17 Setting up automatic analysis pipelines
When doing analyses on data, you have probably experienced (many) times where you have to re-run some analyses and have forgotten the order that code should run or which parts you need to re-run in order to update the results. Things can get confusing quickly, even in relatively simple projects. This becomes even more challenging when you return to a project after a month or two and have forgotten the state of the analysis and the project as a whole. Add in collaborators, and things can get even more complex.
That’s where formal data analysis pipeline tools come in. By organising your analyses into distinct steps, with clear inputs and outputs, and adding these steps to a pipeline that tracks them, you can make things a lot easier for yourself and others. This session focuses on using tools that create and manage these pipelines effectively.
17.1 Learning objectives
- Describe the computational meaning of pipeline and how pipelines are often used in research.
- Explain why a well-designed pipeline can streamline collaboration, reduce time spent on an analysis, make the analysis steps explicit and easier to work with, and ultimately contribute to more fully reproducible research.
- Explain the difference between a “function-oriented” workflow and a “script-oriented” workflow, and why the function-based approach has multiple advantages from a time- and effort-efficiency point of view.
- Setup an analysis pipeline using targets that clearly defines each step of your analysis—from raw data to finished manuscript— that makes updating your analysis by you or your collaborators as simple as running a single function.
17.2 💬 Discussion activity: How do you re-run analyses when something changes?
Time: ~10 minutes.
We’ve all been (or will be) in situations where something in our analysis needs to change: Maybe we forgot to remove a certain condition (like unrealistic BMI values), maybe our supervisor suggests something we hadn’t considered in our analysis, or maybe during peer review of our manuscript, a reviewer makes a suggestion that would improve the paper.
Whatever the situation, we inevitably need to re-run parts of or the full analysis. So what is your exact workflow when you need to re-run code and update your results? Assume it’s a change in the code somewhere early in the data processing stage.
- Take about 1 minute to think about the workflow you use. Try to think of the exact steps you need to take, what exactly you do, and how long that usually takes.
- For 6 minutes, share and discuss your thoughts in with your neighbour. How do your experiences compare to each other?
- For the remaining time, we’ll briefly share with everyone what they’ve thought and discussed.
17.3 📖 Reading task: What is a data analysis “pipeline”?
Time: ~10 minutes.
After they finish reading this section, briefly walk through it. In particular, emphasize what we want to make at the end, even though that goal might change as the analysis project progresses.
A pipeline can be any process where the steps between a start and an end point are very clear, explicit, and concrete. These highly distinct steps can be manual, human involved, or completely automated by a robot or computer. For instance, in car factories, the pipeline from the input raw materials to the final vehicle is extremely well described and implemented. Similarly, during the pandemic, the pipeline for testing (at least in Denmark and several other countries) was highly structured and clear for both the workers doing the testing and the people having the test done: A person goes in, scans their ID card, has the test done, the worker inputs the results, and the results are sent immediately to the health agency as well as to the person based on their ID contact information (or via a secure app).
However, in research, especially around data collection and analysis, you may often hear or read about “pipelines”. But looking closer, these aren’t actual pipelines because the individual steps are not very clear and not well described, and they often require a fair amount of manual human attention and intervention. Often these “pipelines” are general workflows that people follow to complete a specific set of tasks.
In computational environments, a pipeline should be a set of data processing steps connected in a series with a specific order; the output of one step is the input to the next. This means that there actually should be minimal to no human intervention from raw input to finished product. And it means that the entire process should be automated and reproducible.
So why are these computational data pipelines rarely found in most of the research “pipelines” described in papers? Because:
- Not all researchers write code.
- Researchers who do write code rarely publish and share it.
- Code that is shared or published (either publicly or within the research group) is written in a way that makes it hard to make as a pipeline.
- And, research is largely non-reproducible (1–3).
A data analysis pipeline would, by definition, be a readable and reproducible data analysis. Unfortunately, researchers (on average and as a group) don’t have the technical expertise and knowledge to actually implement data analysis pipelines.
This isn’t to diminish the work of researchers, but is rather a basic observation on the systemic, social, and structural environment surrounding us. We as researchers are not necessarily trained in writing code, nor is there a strong culture and incentive structure around learning, sharing, reviewing, and improving code. Which also results in us very rarely being allowed to get (or use) funds to hire people who are trained and skilled in programmatic designing, thinking, and coding. Otherwise workshops like this wouldn’t need to exist 🤷♀️
Before we get to what an actual data analysis pipeline looks like in practice, we have to separate two things: exploratory data analysis and data analysis for the final paper. In exploratory data analysis, there will likely be a lot of manual, interactive steps involved that may or may not need to be explicitly stated and included in the analysis plan and pipeline.
But for the final paper and the included results, you generally have some basic first ideas of what you’ll need. And when working on projects, it’s good practice to follow the principle “start at the end and work backwards.” What take means is, think about and design what your final product will be and what it (generally) will contain. In research products like papers, you usually have several tables and figures in the paper. Those are “sub-products” of the main “product”. So let’s list a few “sub-products” (we’ll just call “products”) that you almost always have in a research paper:
- A table of some basic descriptive statistics of the study population, such as mean, standard deviation, or counts of basic discrete data (like treatment group). Usually this is a “table 1” in the paper.
- A figure showing the distribution of your main variables of interest. In this case, ours are the lipidomic variables. Like with the table above, this is usually “figure 1” or a similarly early figure in the paper.
- Lastly, the paper with the above table and figure included.
Let’s break this down visually in Figure 17.1:
Here, there are three distinct items (“products”) that you (might) want to create from your data. Each “flow” from data to the desired product has specific actions that need to happen. Now that you’ve seen how it is conceptually drawn out, there are some initial tasks you can work towards in your pipeline. In this session, we will setup up the pipeline so that we can eventually build up the actions piece-by-piece to go from data to final product.
17.4 Using targets to manage a pipeline
There are a few packages that help build pipelines in R, but the most commonly used, well-designed, and maintained one is called targets. This package allows you to explicitly set “targets” (objects like the table or figure) you want to create as part of the pipeline. targets will then track all your targets for you and tracks which output a target is input to another target as well as which target need to be updated when you make changes to your pipeline.
Ask participants, which do they think it is: a build dependency or a workflow dependency. Because it is directly used to run analyses and process the data, it would be a build dependency.
First, we need to install and add targets to our dependencies. Since targets is a build dependency, we’ll add it to the DESCRIPTION file with:
Console
use_package("targets")Now that it’s added as a dependency, let’s set up our project to start using targets! There’s a handy helper function that does this for us:
Console
targets::use_targets()This command will create a _targets.R script file that we’ll use to define our analysis pipeline. Before we do that though, let’s commit the file to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
If you are using an older version of the targets package, you might also find that running targets::use_targets() also have created a run.R and run.sh file. The files are used for other situations (like running on a Linux server) that we won’t cover in this workshop. You can safely delete them.
17.5 📖 Reading task: Inside the _targets.R file
Time: ~8 minutes.
For this reading task, you’ll take a look at what _targets.R contains. So, start by opening it on your computer.
Mostly, the_targets.R script contains comments and a basic set up to get you started. But, notice the tar_target() function used at the end of the script. There are two main arguments for it: name and command. The way that targets works is similar to how you’d assign the output of a function to an object, so:
object_name <- function_in_command(input_arguments)Is the same as:
tar_target(
name = object_name,
command = function_in_command(input_arguments)
)What this means is that targets follows a “function-oriented” workflow, not a “script-based” workflow. What’s the difference? In a script-oriented workflow, each R file/script is run in a specific order. As a result, you might end up with a “main” R file that has code that looks like:
Where it will run each of these R scripts in a specific order. Meanwhile, in a function-oriented workflow, it might look more like:
source("R/functions.R")
raw_data <- load_raw_data("file/path/data.csv")
processed_data <- process_data(raw_data)
basic_stats <- calculate_basic_statistics(processed_data)
simple_plot <- create_plot(processed_data)With the function-oriented workflow, each function takes an input and contains all the code to create one result as its output. This could be, for instance, a figure in a paper.
If you’ve taken the intermediate R workshop, you’ll notice that this function-oriented workflow is the workflow we covered in that workshop. There are so many advantages to this type of workflow which is why many powerful R packages are designed around making use of it.
If we take the same code as above and convert it into the targets format, the end of _targets.R would like this:
list(
# This part is special, as it tells targets that the
# character string is a file path.
tar_target(
name = data_file,
command = "file/path/data.csv",
format = "file"
),
tar_target(
name = raw_data,
command = load_raw_data(data_file)
),
tar_target(
name = processed_data,
command = process_data(raw_data)
),
tar_target(
name = basic_stats,
command = calculate_basic_statistics(processed_data)
),
tar_target(
name = simple_plot,
command = create_plot(processed_data)
)
)So, each tar_target() is a step in the pipeline. The command is the function call, while the name is the name of the output of the function call.
In targets, in order for it to effectively track, or “watch”, changes to a file, we have to tell it that a target is a file by using the format = "file" argument. This way, if the file changes, targets knows that it needs to re-run any targets that depend on it.
This relationship between objects and functions (because of the function-oriented workflow) can be represented exactly as a graph, with nodes and arrows. Each node is an R object and each arrow is the function that converts that object to another object. Since the targets pipeline is structured using a function-oriented workflow, the pipeline can also be visually represented as a graph, which we show in Figure 17.2.
And in fact, targets has some functionality to visually generate a graph of the targets and their connections, which we will cover later in this session.
17.6 Building up our first target
Walkthrough this section, especially the diagram of the pipeline. Emphasise that it helps to sketch and design this out first, before actually writing any code.
Now that we’ve conceptually learned about how targets works, let’s take a big step back and first graphically design our own pipeline. So we want to create the three items we mentioned before: a descriptive statistics table, a distribution plot of the lipidomic variables, and the final paper with these items included. We also have our data in a CSV file. So there are at least five targets we need to create in our pipeline. Let’s draw it out:
The very first target we want to create is to load in data from a file. So let’s add it as a step in the targets pipeline. Open up the _targets.R file and go to the end of the file, where the list() and tar_target() code is found. Delete all but one of the tar_target() inside the list().
To track the actual data file, we will need to create a pipeline target that defines the location of our file by using the argument format = "file".
_targets.R
list(
tar_target(
name = file,
command = "data/lipidomics.csv",
format = "file"
)
)Let’s run targets to see what happens! You can either use the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets run”) or run this code in the Console:
Console
targets::tar_make()For the rest of the workshop we will use the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets run”) (the “foreground” one) rather than the function in the Console.
After running it, you should see in one of RStudio’s panes something that looks like:
+ file dispatched
✔ file completed [4ms, 24.28 kB]
✔ ended pipeline [123ms, 1 completed, 0 skipped]
This tells us that the target called file has been run. Since we only have one target, that’s all that targets could run.
We also see that a new folder has been created called _targets/. Inside this folder targets keeps all of the output from running the code in _targets.R. It comes with its own .gitignore file so that you don’t track all the files inside, as they don’t need to be tracked. Only the _targets/meta/meta is needed to be tracked in Git.
We can visualize our pipeline now too! This can be useful as you add more and more targets. We will (likely) need to install an extra package (done automatically):
Console
targets::tar_visnetwork()We can also use the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets visual”), which is what we will use for the rest of the workflow rather than the Console command.
Walkthrough and explain the visualisation. Take your time moving things around, zooming in and out, and clicking around.
If you want to check which targets are outdated, we use:
Console
targets::tar_outdated()Or by using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets outdated”).
Now, let’s actually load in the data as a target. We’ll create a new tar_target() that reads in the data using readr::read_csv(), using the file target as input. In the _targets.R file at the list() section at the bottom, add another tar_target() that looks like:
We use show_col_types = FALSE to suppress the message explaining which columns readr is reading into R. Now, let’s run targets again using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets run”). It should now also show something like:
+ lipidomics dispatched
✔ lipidomics completed [109ms, 4.88 kB]
✔ ended pipeline [234ms, 1 completed, 1 skipped]
After it runs, check the visualisation of it with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets visual”). You can now see that there are two targets, with an arrow from file to lipidomics.
Walkthrough this again, this time showing the new target that was added and the arrow between the targets.
How can we access that new lipidomics data that targets created? For that, we use the function targets::tar_read(). targets stores the output of each target in its own internal storage system, which is kept in the _targets/ folder. So when we use targets::tar_read(), we access any stored target items using syntax like we would with dplyr, without quotes.
So in the Console, let’s see how our data looks like by running:
Console
targets::tar_read(lipidomics)# A tibble: 504 × 6
code gender age class metabolite value
<chr> <chr> <dbl> <chr> <chr> <dbl>
1 ERI109 M 25 CT TMS (internal standard) 208.
2 ERI109 M 25 CT Cholesterol 19.8
3 ERI109 M 25 CT Lipid CH3- 1 44.1
4 ERI109 M 25 CT Lipid CH3- 2 147.
5 ERI109 M 25 CT Cholesterol 27.2
6 ERI109 M 25 CT Lipid -CH2- 587.
7 ERI109 M 25 CT FA -CH2CH2COO- 31.6
8 ERI109 M 25 CT PUFA 29.0
9 ERI109 M 25 CT Phosphatidylethanolamine 6.78
10 ERI109 M 25 CT Phosphatidycholine 41.7
# ℹ 494 more rows
We use this syntax whenever we want to output or print a target object, including for tables and figures.
We’re now ready to start building our first item/product! But first, since we finished writing some code, let’s style _targets.R using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”). Before continuing, let’s commit the changes we’ve made (including the files in the _targets/ folder) by putting it to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”). Then push the changes to GitHub.
17.7 📖 Reading task: Fixing issues in the stored pipeline data
Time: ~10 minutes.
Sometimes you need to start from the beginning and clean everything up because there’s an issue that you can’t seem to fix. In this case, targets has a few functions to help out. Here are four that you can use to delete stuff (also described on the targets book):
tar_invalidate()-
This removes the metadata on the target in the pipeline, but doesn’t remove the object itself (which
tar_delete()does). This will tell targets that the target is out of date, since it has been removed, even though the data object itself isn’t present. You can use this like you wouldselect(), by naming the objects directly or using the tidyselect helpers. For example,tar_invalidate(everything())will remove all metadata details for all targets in the pipeline, andtar_invalidate(starts_with("data_"))will remove all metadata details for targets that start withdata_. tar_delete()-
This deletes the stored objects (e.g. the
lipidomicsobject) inside_targets/, but does not delete the record in the pipeline. So targets will see that the pipeline doesn’t need to be rebuilt. This is useful if you want to remove some data because it takes up a lot of space, or, in the case of GDPR and privacy rules, you don’t want to store any sensitive personal health data in your project. Use it liketar_invalidate(), with functions likeeverything()orstarts_with(). For example,tar_delete(everything())will delete all stored objects in_targets/andtar_delete(starts_with("data_"))will delete all stored objects that start withdata_. tar_prune()-
This function is useful to help clean up left over or unused objects in the
_targets/folder. You will probably not use this function too often. tar_destroy()-
The most destructive, and probably more commonly used, function. This will delete the entire
_targets/folder for those times when you want to start over and re-run the entire pipeline again.
17.8 Summary
Use a function-oriented workflow together with targets to build your data analysis pipeline and track your pipeline “targets”.
List individual “pipeline targets” using
tar_target()within the_targets.Rfile and run them withtargets::tar_make().Visualize target items in your pipeline with
targets::tar_visnetwork()or list outdated items withtargets::tar_outdated().Use
targets::tar_read()to access saved pipeline outputs.
17.9 Survey
Please complete the survey for this session:
17.10 Code used in session
This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.
use_package("targets")
targets::use_targets()
list(
tar_target(
name = file,
command = "data/lipidomics.csv",
format = "file"
)
)
targets::tar_make()
targets::tar_visnetwork()
targets::tar_outdated()
list(
tar_target(
name = file,
command = "data/lipidomics.csv",
format = "file"
),
tar_target(
name = lipidomics,
command = readr::read_csv(file, show_col_types = FALSE)
)
)
targets::tar_read(lipidomics)