6 Creating automatic analysis pipelines
Depending on how much time you’ve spent working on data analyses, you have probably experienced (many) times where you are working on a project and forget what code needs to be run first, what order other code needs to run in, and what pieces of code need to be re-run in order to update other results. Things get confusing quickly, even for fairly simple projects. This probably happens most often when you return to a project after a month or two and completely forget the state of the project and analysis.
This is where formal data analysis pipeline tools come in and play a role. By setting up your analysis into distinct steps, with clear inputs and outputs, and use a system that tracks those inputs and outputs, you can make things a lot easier for yourself and others. This session is about applying tools that make and manage these pipelines.
6.1 Learning objectives
The overall objective for this session is to:
- Identify and apply an approach to create an analysis pipeline that makes your analysis steps, from raw data to finished manuscript, explicitly defined so that updating it by either you or collaborators is as easy as running a single function.
More specific objectives are to:
- Describe the computational meaning of pipeline and how pipelines are often done in research. Explain why a well-designed pipeline can streamline your collaboration, reduce time spent doing an analysis, make your analysis steps explicit and easier to work on, and ultimately contribute to more fully reproducible research.
- Explain the difference between a “function-oriented” workflow vs a “script-oriented” workflow and why the function-based approach has multiple advantages from a time- and effort-efficiency point of view.
- Use the functions within targets to apply the concepts of building pipelines in your analysis project.
- Continue applying the concepts and functions used from the previous session.
6.2 Exercise: How do you re-run analyses when something changes?
Time: ~12 minutes.
We’ve all been in situations where something in our analysis needs to change. Maybe we forgot to remove a certain condition (like unrealistic BMI). Or maybe our supervisor suggests something we hadn’t considered in the analysis. Or maybe during peer review of our manuscript, a reviewer makes a suggestion that would improve the understanding of the paper. Whatever the situation, we inevitably need to re-run our analysis. And depending on what the change was, we might need to run the full analysis all over again. So what is your exact workflow when you need to re-run code and update your results? Assume it’s a change somewhere early in the data processing stage.
- Take about 1 minute to think about the workflow you use. Try to think of the exact steps you need to take, what exactly you do, and how long that usually takes.
- For 8 min, in your group share and discuss what you’ve thought. How do your experiences compare to each other?
- For the remaining time, we’ll briefly share with everyone about what they’ve thought and discussed.
6.3 What is a data analysis “pipeline”?
6.4 Using targets to manage the pipeline
There are a few packages to help build pipelines in R, but the most commonly used, well-designed, and maintained one is called targets. With this package, you specify outputs you want to create and targets will track them for you. So it will know which output depends on which other one and which ones need to be updated.
First, we need to install targets in the project environment. And since targets is a build dependency, we add it to the DESCRIPTION
file with:
Console
use_package("targets")
Now that it’s added to the project R library, let’s set up our project to start using it!
Console
targets::use_targets()
This will add several files:
.
├── _targets.R
├── run.R
└── run.sh
The most important file is the _targets.R
file. The other two files are used for other situations (like running on a Linux server) that we won’t cover in the course. Before we continue though, let’s commit these new files to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
Let’s start writing code to create the three items we listed above in Figure 6.1: some descriptive statistics, a plot of the continuous lipid variables, and a report (R Markdown). Since we’ll use tidyverse, specifically dplyr, to calculate the summary statistics, we need to add it to our dependencies and install it in the project library. tidyverse is a special “meta”-package so we need to add it to the "depends"
section of the DESCRIPTION
file.
Console
use_package("tidyverse", "depends")
Commit the changes made to the DESCRIPTION
file in the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
Now, let’s start doing some data analysis so that we can add to our pipeline later on. First, open up the doc/learning.qmd
file and create a new header and code chunk at the bottom of the file.
doc/learning.qmd
```{r setup}
library(tidyverse)
source(here::here("R/functions.R"))
lipidomics <- read_csv(here::here("data/lipidomics.csv"))
```
## Basic statistics
```{r basic-stats}
```
We want to calculate the mean and SD for each metabolite and then, to make it more readable, to round the numbers to one digit. We covered this in the functionals session of the intermediate course, so we will apply these same principles and code here. To do that, we need to use group_by()
on the metabolites
, use across()
inside summarise()
so we can give it the mean()
and sd()
functions, followed by mutate()
ing each numeric column (across(where(is.numeric))
) to round()
to 1 digit. Let’s write out the code!
doc/learning.qmd
# A tibble: 12 × 3
metabolite value_mean value_sd
<chr> <dbl> <dbl>
1 CDCl3 (solvent) 180 67
2 Cholesterol 18.6 11.4
3 FA -CH2CH2COO- 33.6 7.8
4 Lipid -CH2- 537. 61.9
5 Lipid CH3- 1 98.3 73.8
6 Lipid CH3- 2 168. 29.2
7 MUFA+PUFA 32.9 16.1
8 PUFA 30 24.1
9 Phosphatidycholine 31.7 20.5
10 Phosphatidylethanolamine 10 7.6
11 Phospholipids 2.7 2.6
12 TMS (interntal standard) 123 130.
After that, style the file using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) on the file. Than we will commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
6.5 Exercise: Convert summary statistics code into a function
Time: ~20 minutes.
While inside the doc/learning.qmd
file, use the “function-oriented” workflow, as taught in the intermediate course, to take the code we wrote above and convert it into a function. Complete these tasks:
-
Wrap the code with
function() {...}
and name the new functiondescriptive_stats
. Here is some scaffolding to help you get started:doc/learning.qmd
<- function(___) { descriptive_stats ___ }
Replace
lipidomics
withdata
and putdata
as an argument inside the brackets offunction()
.Add
dplyr::
to the start of each dplyr function used inside your function (except forwhere()
, which comes from the tidyselect package).Style the code using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) to make sure it is formatted correctly. You might need to manually force a styling if lines are too long.
With the cursor inside the function, add some roxygen documentation with Ctrl-Shift-PCtrl-Shift-P followed by typing “roxygen comment”. Remove the lines that contain
@examples
and@export
, then fill in the other details (like the@params
andTitle
). In the@return
section, write “A data.frame/tibble.”Cut and paste the function over into the
R/functions.R
file.Source the
R/functions.R
file with Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”), and then test the code by runningdescriptive_stats(lipidomics)
in the Console. If it works, do the last task.Save both files and then open the Git interface and commit the changes you made to them with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
In the intermediate course, we highly suggested using return()
at the end of the function. Technically we don’t need an explicit return()
, since the output of the last code that R runs within the function will be the output of the function. This is called an “implicit return” and we will be using this feature throughout the rest of this course.
Click for a potential solution. Only click if you are struggling or are out of time.
#' Calculate descriptive statistics of each metabolite.
#'
#' @param data The lipidomics dataset.
#'
#' @return A data.frame/tibble.
#'
descriptive_stats <- function(data) {
data |>
dplyr::group_by(metabolite) |>
dplyr::summarise(dplyr::across(value, list(mean = mean, sd = sd))) |>
dplyr::mutate(dplyr::across(tidyselect::where(is.numeric), ~round(.x, digits = 1)))
}
6.6 Adding a step in the pipeline
Now that we’ve created a function to calculate some basic statistics, we can now add it as a step in the targets pipeline. Open up the _targets.R
file and go to the end of the file, where the list()
and tar_target()
code are found. In the first tar_target()
, replace the target to load the lipidomic data. In the second, replace it with the descriptive_stats()
function. If we want to make it easier to remember what the target output is, we can add df_
to remind us that it is a data frame. It should look like:
Let’s run targets to see what happens! You can either use the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets run”) or run this code in the Console:
Console
targets::tar_make()
While this targets pipeline works, we would not be able to automatically re-run our pipeline if our underlying data changes. To track the actual data file we need to create a pipeline target of the data file. We can accomplish this by using the argument format = "file"
inside the tar_target()
before loading the data using readr.
Now, let’s try running targets again using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets run”).
It probably won’t run though. That’s because targets doesn’t know about the packages that you need for the pipeline. To add it, we need to go to the tar_option_set()
section of the _targets.R
file and add to the packages = c("tibble")
code with the packages we use that aren’t explicitly called via ::
(e.g. |>
). For now, we only need to add "dplyr"
to the packages
argument.
We can now put this code in the packages
argument of tar_option_set()
in the _targets.R
file:
targets.R
packages = c("tibble", "dplyr")
Try running targets again with either targets::tar_make()
or the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets run”). It should run through! We also see that a new folder has been created called _targets/
. Inside this folder it will keep all of the output from running the code. It comes with i’s own .gitignore
file so that you don’t track all the files inside, since they aren’t necessary. Only the _targets/meta/meta
is needed to include in Git.
We can visualize our individual pipeline targets that we track through tar_target()
now too, which can be useful as you add more and more targets. We will (likely) need to install an extra package (done automatically):
Console
targets::tar_visnetwork()
Or to see what pipeline targets are outdated:
Console
targets::tar_outdated()
Before continuing, let’s commit the changes (including the files in the _targets/
folder) to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
6.7 Creating figure outputs
Not only can we create data frames with targets (like above), but also figures. Let’s write some code to create the plot we listed as our “output 2” in Figure 6.1. Since we’re using ggplot2 to write this code, let’s add it to our DESCRIPTION
file.
Console
use_package("ggplot2")
Next, we’ll switch back to doc/learning.qmd
and write the code to this plot of the distribution of each metabolite. We’ll use geom_histogram()
, nothing too fancy. And since the data is already in long format, we can easily use facet_wrap()
to create a plot for each metabolite. We use scales = "free"
because each metabolite doesn’t have the same range of values (some are small, others are quite large).
doc/learning.qmd
metabolite_distribution_plot <- ggplot(lipidomics, aes(x = value)) +
geom_histogram() +
facet_wrap(vars(metabolite), scales = "free")
metabolite_distribution_plot
We now have the basic code to convert over into functions.
6.8 Exercise: Convert the plot code to a function
Time: ~10 minutes.
For now, we will only take the code to make the distribution plot and convert it into a function. Just like you did with the descriptive_stats()
function in the exercise above, complete these tasks:
-
Wrap the plot code inside
doc/learning.qmd
withfunction() {...}
and name the new functionplot_distributions
. Use this scaffolding code to help guide you to write the code into a function.doc/learning.qmd
<- function(___) { plot_distributions ___ }
Replace
lipidomics
withdata
and putdata
as an argument inside the brackets offunction()
.Add
ggplot2::
to the start of each ggplot2 function used inside your function.Style using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) to make sure it is formatted correctly. You might need to manually force a styling if lines are too long.
With the cursor inside the function, add some roxygen documentation with Ctrl-Shift-Alt-RCtrl-Shift-Alt-R or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “roxygen comment”). Remove the lines that contain
@examples
and@export
, then fill in the other details (like the@params
andTitle
). In the@return
section, write “A plot object.”Cut and paste the function over into the
R/functions.R
file.Source the
R/functions.R
file (Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”)) and then test the code by runningplot_distributions(lipidomics)
in the Console. If it works, do the last task.Save both files and then open the Git interface and commit the changes you made to them with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
Click for the solution. Only click if you are struggling or are out of time.
## This should be in the R/functions.R file.
#' Plot for basic distribution of metabolite data.
#'
#' @param data The lipidomics dataset.
#'
#' @return A ggplot2 graph.
#'
plot_distributions <- function(data) {
data |>
ggplot2::ggplot(ggplot2::aes(x = value)) +
ggplot2::geom_histogram() +
ggplot2::facet_wrap(ggplot2::vars(metabolite), scales = "free")
}
6.9 Adding the plot function as pipeline targets
Now, let’s add the plot function to the _targets.R
file. Let’s write this tar_target()
item within the list()
inside _targets.R
. To make it easier to track things, add fig_
to the start of the name
given.
targets.R
list(
...,
tar_target(
name = fig_metabolite_distribution,
command = plot_distributions(lipidomics)
)
)
Next, test that it works by running targets::tar_visnetwork()
using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets visual”) or running targets::tar_outdated()
with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets outdated”). You should see that the new item is “outdated”. Then run targets::tar_make()
using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets run”) to update the pipeline. If it all works, than commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
6.10 Incorporating Quarto targets
Last, but not least, we want to make the final output 3 from Figure 6.1: The Quarto document. Adding a Quarto document as a target inside _targets.R
is fairly straightforward. We need to install the helper package tarchetypes first, as well as the quarto R package (it helps connect with Quarto):
Console
use_package("tarchetypes")
use_package("quarto")
Then, inside _targets.R
, uncomment the line where library(tarchetypes)
is commented out. The function we need to use to build the Quarto file is tar_quarto()
(or tar_render()
for R Markdown files), which needs two things: The name
, like tar_target()
needs, and the file path to the Quarto file. Again, like the other tar_target()
items, add it to the end of the list()
. Since we’re using the doc/learning.qmd
as a sandbox, we won’t include it as a pipeline target. Instead we will use the doc/learning.qmd
file:
targets.R
list(
...,
tar_quarto(
name = quarto_doc,
path = "doc/learning.qmd"
)
)
Now when we run targets::tar_make()
with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets run”), the Quarto file also gets re-built. But when we use targets::tar_visnetwork()
using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets visual”), we don’t see the connections with plot and descriptive statistics. That’s because we haven’t used them in a way targets can recognize. For that, we need to use the function targets::tar_read()
. And because our Quarto file is located in the doc/
folder, we also have to tell Quarto where the targets “store” (stored objects that targets::tar_read()
looks for), which we do by using targets::tar_config_set()
.
Let’s open up the doc/learning.qmd
file, add a setup
code chunk below the YAML header, and create a new header and code chunk and make use of the targets::tar_read()
.
doc/learning.qmd
---
# YAML header
---
```{r setup}
targets::tar_config_set(store = here::here("_targets"))
library(tidyverse)
library(targets)
source(here::here("R/functions.R"))
lipidomics <- tar_read(lipidomics)
```
## Results
```{r}
tar_read(df_stats_by_metabolite)
```
```{r}
tar_read(fig_metabolites_distribution)
```
With targets::tar_read()
, we can access all the stored target items using syntax like we would with dplyr, without quotes. For the df_stats_by_metabolite
, we can do some minor wrangling with mutate()
and glue::glue()
, and than pipe it to knitr::kable()
to create a table in the output document. The glue package is really handy for formatting text based on columns. If you use {}
inside a quoted string, you can use columns from a data frame, like value_mean
. So we can use it to format the final table text to be mean value (SD value)
:
Metabolite | Mean (SD) |
---|---|
CDCl3 (solvent) | 180 (67) |
Cholesterol | 18.6 (11.4) |
FA -CH2CH2COO- | 33.6 (7.8) |
Lipid -CH2- | 536.6 (61.9) |
Lipid CH3- 1 | 98.3 (73.8) |
Lipid CH3- 2 | 168.2 (29.2) |
MUFA+PUFA | 32.9 (16.1) |
PUFA | 30 (24.1) |
Phosphatidycholine | 31.7 (20.5) |
Phosphatidylethanolamine | 10 (7.6) |
Phospholipids | 2.7 (2.6) |
TMS (interntal standard) | 123 (130.4) |
Re-run targets::tar_visnetwork()
using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets visual”) to see that it now detects the connections between the pipeline targets. Then, run targets::tar_make()
with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets run”) again to see everything re-build! Last things are to re-style using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”), then commit the changes to the Git history before moving on with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”). Then push your changes up to GitHub.
6.11 Fixing issues in the stored pipeline data
6.12 Summary
- Use a function-oriented workflow together with targets to build your data analysis pipeline and track your “pipeline targets”.
- List individual “pipeline targets” using
tar_target()
within the_targets.R
file. - Visualize target items in your pipeline with
targets::tar_visnetwork()
or list outdated items withtargets::tar_outdated()
. - Within R Markdown / Quarto files, use
targets::tar_read()
to access saved pipeline outputs. To include the Quarto in the pipeline, use tarchetypes and the functiontar_quarto()
. - Delete stored pipeline output with
tar_delete()
.