1 Syllabus
Reproducibility and open scientific practices are increasingly in demand and needed by scientists and researchers in modern research environments. Our work often require or involve a high level of hands-on collaboration on scientific projects throughout all stages of the research lifecycle. We are faced with obstacles that we have no training for nor knowledge on how to address. Obstacles that are as simple as not having a shared way of writing code or managing files can impede effective collaboration. Or the obstacles can be complex, like documenting software dependencies or steps in an analysis pipeline in a way that makes it easy to resolve issues and get the most recent set of results after collaborators have worked on a project.
Aside from the impact on collaboration, these obstacles can even affect research projects with just one primary researcher. Compounding this issue is that researchers’ main incentivised output, publications, doesn’t easily allow for other forms of output that could help address or resolve these obstacles. For example, a research group’s workflow or procedures is valuable information that other groups could use and learn from, but “publishing” this type of information isn’t rewarded so it is rarely done. Likewise, the code used in analyses is rarely shared, even though it would improve the reproducibility of the research results and also allow others to learn from it. Ultimately, all of this leads to less reliability and reproducibility of scientific results. With this workshop, we aim to begin addressing this gap.
This workshop lasts 3 days and is split into the following sessions, listed in the schedule, which will be covered in order:
- Introduction to the workshop
- Dependency management for smoother collaboration
- Creating automatic analysis pipelines
- Creating automatic analysis pipelines
- Design your analysis: Build up one model first
- Extend your analysis to run many models
- Visualising the results of many models
1.1 Learning outcome and objectives
The overall aim of this workshop is to enable you to:
Describe what an open, collaboration-friendly, and nearly-automated reproducible data analysis pipeline and workflow looks like.
Design your code and analysis using simple principles and concepts that allow you to write more flexible and robust code that does more for less and that is friendlier to both you and your collaborators.
Create an R project that follows these practices.
Broken down into specific objectives, we’ve designed the workshop to enable you to do the following in each session:
Dependency management for smoother collaboration
- Identify potential actions to take that streamline collaboration on a data analysis project.
- Explain what a “project-oriented” workflow is, what project-level R dependency management is, and why these concepts are important for collaborative and reproducible analyses.
- Describe the difference between workflow dependencies and build dependencies, and apply functions in the usethis R package to implement these dependency management concepts.
- Explain how following a style guide helps build a common approach to reading and writing R code, and thereby improve project-level collaboration.
- Use styler and RStudio’s canonical markdown mode to programmatically check and apply style guides to your project files.
Setting up automatic analysis pipelines
- Describe the computational meaning of pipeline and how pipelines are often used in research.
- Explain why a well-designed pipeline can streamline collaboration, reduce time spent on an analysis, make the analysis steps explicit and easier to work with, and ultimately contribute to more fully reproducible research.
- Explain the difference between a “function-oriented” workflow and a “script-oriented” workflow, and why the function-based approach has multiple advantages from a time- and effort-efficiency point of view.
- Setup an analysis pipeline using targets that clearly defines each step of your analysis—from raw data to finished manuscript— that makes updating your analysis by you or your collaborators as simple as running a single function.
Using the pipeline to build up research output
- Build an analysis pipeline using targets that clearly defines each step of your analysis—from raw data to finished manuscript— that makes updating your analysis by you or your collaborators as simple as running a single function.
- Apply a design-driven approach to building the pipeline step by step, to help manage complexity and help you focus on testing and building the pipeline at each stage.
- Use the principle of “start at the end”, but working backwards from your desired final outputs to the raw data, to help design, plan, and build your pipeline effectively.
Design your analysis: Build up one model first
- Design a rough outline of an analysis workflow where data flows through multiple functions to produce a desired output.
- Apply functional programming concepts to run statistical analyses that fit within the targets pipeline framework, regardless of what statistical method is used.
- Decompose a complex statistical analysis into smaller functions, where each function achieves a part of the larger output so that the function can later be used flexibly and scalably with any number of inputs.
- Use the broom package to extract the model results in a tidy form.
Extend your analysis to run many models
- Recall principles of functional programming.
- Design analyses to make use of the principles of functional programming and the split-apply-combine technique.
- Apply these principles to write code that does many things with little code, in particular by using the purrr package.
Visualising the results of many models
- Use an effective way to visualise the results from multiple statistical models fitted to different subsets of data using the ggplot2 R package.
- Identify how a fully reproducible analysis pipeline looks like, from start to finish, and how this workflow might make a data analysis project easier for you and your collaborators to work on together.
And we will not learn:
- Any details on or about specific statistical methods or models (these are already covered by most university curriculum). We cover how to run statistical methods, but not which statistical methods to use for your data or project.
Tangibly, during the workshop you will:
- Style your code using styler to improve readability and consistency.
- Track package dependencies in the
DESCRIPTIONfile as either workflow or build dependencies. - Build automated pipelines with targets.
- Structure your analysis into steps as functions that piece together into a cohesive series of steps.
- Run multiple statistical analyses efficiently using purrr and functional programming.
- Manage and collaborate on your projects using Git and GitHub.
- Write reproducible reports using Quarto.
Because learning and coding is ultimately not just a solo activity, during this workshop we also aim to provide opportunities to chat with fellow participants, learn about their work and how they do analyses, and to build networks of support and collaboration.
The specific software and technologies we will cover in this workshop are R, RStudio, Quarto, Git, and GitHub, while the specific R packages are styler, targets, as well as more advanced uses of combining purrr and dplyr.