Want to help out or contribute?

If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.

On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.

6  Creating automatic analysis pipelines

Before beginning, get them to recall what they remember of the previous session, either with something like Mentimeter or verbally. Preferably something like Mentimeter because it allows everyone to participate, not just the ones who are more comfortable being vocal to the whole group.

Depending on how much time you’ve spent working on data analyses, you have probably experienced (many) times where you are working on a project and forget what code needs to be run first, what order other code needs to run in, and what pieces of code need to be re-run in order to update other results. Things get confusing quickly, even for fairly simple projects. This probably happens most often when you return to a project after a month or two and completely forget the state of the project and analysis.

This is where formal data analysis pipeline tools come in and play a role. By setting up your analysis into distinct steps, with clear inputs and outputs, and use a system that tracks those inputs and outputs, you can make things a lot easier for yourself and others. This session is about applying tools that make and manage these pipelines.

6.1 Learning objectives

The overall objective for this session is to:

  1. Identify and apply an approach to create an analysis pipeline that makes your analysis steps, from raw data to finished manuscript, explicitly defined so that updating it by either you or collaborators is as easy as running a single function.

More specific objectives are to:

  1. Describe the computational meaning of pipeline and how pipelines are often done in research. Explain why a well-designed pipeline can streamline your collaboration, reduce time spent doing an analysis, make your analysis steps explicit and easier to work on, and ultimately contribute to more fully reproducible research.
  2. Explain the difference between a “function-oriented” workflow vs a “script-oriented” workflow and why the function-based approach has multiple advantages from a time- and effort-efficiency point of view.
  3. Use the functions within {targets} to apply the concepts of building pipelines in your analysis project.
  4. Continue applying the concepts and functions used from the previous session.

6.2 Exercise: How do you re-run analyses when something changes?

Time: ~14 minutes.

We’ve all been in situations where something in our analysis needs to change. Maybe we forgot to remove a certain condition (like unrealistic BMI). Or maybe our supervisor suggests something we hadn’t considered in the analysis. Or maybe during peer review of our manuscript, a reviewer makes a suggestion that would improve the understanding of the paper. Whatever the situation, we inevitably need to re-run our analysis. And depending on what the change was, we might need to run the full analysis all over again. So what is your exact workflow when you need to re-run code and update your results? Assume it’s a change somewhere early in the data processing stage.

  1. Take 2 min to think about the workflow you use. Try to think of the exact steps you need to take, what exactly you do, and how long that usually takes.
  2. For 8 min, in your group share and discuss what you’ve thought. How do your experiences compare to each other?
  3. For the remaining time, we’ll briefly share with everyone about what they’ve thought and discussed.

6.3 What is a data analysis “pipeline”?

After they finish reading this section, briefly walk through it. In particular, emphasize what we want to make at the end, even though that goal might change as the analysis progresses.

Reading task: ~10 minutes

A pipeline can be any process where the steps between a start and an end point are very clear, explicit, and concrete. These highly distinct steps can be manual, human involved or can be completely automated by a robot or computer. For instance, in car factories, the pipeline from the input raw materials to the output vehicle are extremely well described and implemented. Or like during the pandemic, the pipeline for testing (at least in Denmark and several other countries) was highly structured and clear for the workers doing the testing and the people having the test done: A person goes in, scans their ID card, has the test done, worker inputs the results, results get sent immediately to the health agency as well as to the person based on their ID contact information (or via a secure app).

However, in research, especially around data collection and analysis, we often hear or read about “pipelines”. But looking closer, these aren’t actual pipelines because the individual steps are not very clear and not well described, often requiring a fair amount of manual human attention and intervention. Particularly within computational environments, a pipeline is when there is minimal to no human intervention from raw input to finished output. Why aren’t these data “pipelines” in research actual pipelines? Because:

  1. Anything with data ultimately must be on the computer,
  2. Anything automatically done on the computer must be done with code,
  3. Not all researchers write code,
  4. Researchers who do write code rarely publish and share it,
  5. Code that is shared or published (either publicly or within the research group) is not written in a way that allows a pipeline to exist,
  6. And, research is largely non-reproducible (13).

A data analysis pipeline would by definition be a reproducible, readable, and code-based data analysis. We researchers as a group have a long way to go before we can start realistically implementing data analysis pipelines.

This isn’t to diminish the work of researchers, but a basic observation on the systemic, social, and structural environment surrounding us. We as researchers are not trained in writing code, nor do we have a strong culture and incentive structure around learning, sharing, reviewing, and improving code. Nor are we often allowed to get (or use) funds to hire people who are trained and skilled in programmatic thinking and programming. Otherwise courses like this wouldn’t need to exist 🤷

So how would a data analysis pipeline look? Before we get to that though, we have to separate two things: exploratory data analysis and final paper data analysis. In exploratory data analysis, there will likely be a lot of manual, interactive steps involved that may or may not need to be explicitly stated and included in the analysis plan and pipeline. But for the final paper and what results would be included, we generally have some basic first ideas. Let’s list a few items that we would want to do before the primary statistical analysis:

  1. A table of some basic descriptive statistics of the study population, such as mean, standard deviation, or counts of basic discrete data (like treatment group).
  2. A figure visualizing the counts or other statistics for variables of interest that are discrete/categorical. For the data in this course, that would be variables like gender and age.
  3. A figure showing the distribution of your main variables of interest. In this case, ours are the lipidomic variables.
  4. The paper with the results included.
%%{init:{'theme':'forest', 'flowchart':{'nodeSpacing': 20, 'rankSpacing':30}}}%%
graph LR
    data[Data] --> fn_desc{{function}} 
    fn_desc --> tab_desc[Table:<br>descriptive<br>statistics]
    data --> fn_plot{{function}}
    fn_plot{{function}} --> plot_discr[Plot:<br>Discrete<br>variables]
    tab_desc --> paper[Paper]
    plot_discr --> paper
    data --> fn_plot_vars{{function}}
    fn_plot_vars{{function}} --> plot_distrib[Plot:<br>Continuous<br>variables]
    plot_distrib --> paper

linkStyle 0,1,2,3,4,5,6,7,8 stroke-width:1px;
Figure 6.1: Very simple flow chart of steps in an analysis pipeline.

Now that we conceptually have drawn out the sets of tasks to complete in our pipeline, we can start using R to build it.

6.4 Using targets to manage the pipeline

There are a few packages to help build pipelines in R, but the most commonly used, well-designed, and maintained one is called {targets}. With this package, you specify outputs you want to create and {targets} will track them for you. So it will know which output depends on which other one and which ones need to be updated.

Ask participants, which do they think it is: a build dependency or a workflow dependency. Because it is directly used to run analyses and process the data, it would be a build dependency.

First, since we are using {renv}, we need to install {targets} in the project environment. And since {targets} is a build dependency, we add it to the DESCRIPTION file with:

use_package("targets")

Now that it’s added to the project R library, let’s set up our project to start using it!

targets::use_targets()

This will add several files:

.
├── _targets.R
├── run.R
└── run.sh

The most important file is the _targets.R file. The other two files are used for other situations (like running on a Linux server) that we won’t cover in the course. Before we continue though, let’s commit these new files to the Git history.

Let them read it before going over it again to reinforce function-oriented workflows and how {targets} and the tar_target() works.

Reading task: ~8

Next, open up the _targets.R and we’ll take a look at what it contains.

# Created by use_targets().
# Follow the comments below to fill in this target script.
# Then follow the manual to check and run the pipeline:
#   https://books.ropensci.org/targets/walkthrough.html#inspect-the-pipeline

# Load packages required to define the pipeline:
library(targets)
# library(tarchetypes) # Load other packages as needed.

# Set target options:
tar_option_set(
  packages = c("tibble") # packages that your targets need to run
  # format = "qs", # Optionally set the default storage format. qs is fast.
  #
  # For distributed computing in tar_make(), supply a {crew} controller
  # as discussed at https://books.ropensci.org/targets/crew.html.
  # Choose a controller that suits your needs. For example, the following
  # sets a controller with 2 workers which will run as local R processes:
  #
  #   controller = crew::crew_controller_local(workers = 2)
  #
  # Alternatively, if you want workers to run on a high-performance computing
  # cluster, select a controller from the {crew.cluster} package. The following
  # example is a controller for Sun Grid Engine (SGE).
  # 
  #   controller = crew.cluster::crew_controller_sge(
  #     workers = 50,
  #     # Many clusters install R as an environment module, and you can load it
  #     # with the script_lines argument. To select a specific verison of R,
  #     # you may need to include a version string, e.g. "module load R/4.3.0".
  #     # Check with your system administrator if you are unsure.
  #     script_lines = "module load R"
  #   )
  #
  # Set other options as needed.
)

# tar_make_clustermq() is an older (pre-{crew}) way to do distributed computing
# in {targets}, and its configuration for your machine is below.
CLUSTERMQ

# tar_make_future() is an older (pre-{crew}) way to do distributed computing
# in {targets}, and its configuration for your machine is below.
FUTURE

# Run the R scripts in the R/ folder with your custom functions:
tar_source()
# source("other_functions.R") # Source other scripts as needed.

# Replace the target list below with your own:
list(
  tar_target(
    name = data,
    command = tibble(x = rnorm(100), y = rnorm(100))
    # format = "feather" # efficient storage for large data frames
  ),
  tar_target(
    name = model,
    command = coefficients(lm(y ~ x, data = data))
  )
)

Next, notice tar_target() function used at the end of the script. There are two main arguments for it name and command. The way that {targets} works is similar to how you’d write R code to assign the output of a function to an object.

object_name <- function_in_command(input_arguments)

Is the same as:

tar_target(
  name = object_name,
  command = function_in_command(input_arguments)
)

What this means is that {targets} follows a “function-oriented” workflow, not a “script-based” workflow. What’s the difference? In a script-oriented workflow, each R file is run in a specific order, so you might end up with an R file that has code like:

source("R/1-process-data.R")
source("R/2-basic-statistics.R")
source("R/3-create-plots.R")
source("R/4-linear-regression.R")

While in a function-oriented workflow, it might look more like:

source("R/functions.R")
raw_data <- load_raw_data("file/path/data.csv")
processed_data <- process_data(raw_data)
basic_stats <- calculate_basic_statistics(processed_data)
simple_plot <- create_plot(processed_data)
model_results <- run_linear_reg(processed_data)

With this workflow, each function takes an input dataset and contains all the code to create the results into one output, like a figure in a paper. If you’ve taken the intermediate R course, you’ll notice that this function-oriented workflow is the workflow we covered. There are so many advantages to this type of workflow and is the reason many powerful R packages are designed around making use of this type workflow.

If we take these same code and convert it into the {targets} format, the end of _targets.R file would like this:

list(
  tar_target(
    name = raw_data,
    command = load_raw_data("file/path/data.csv")
  ),
  tar_target(
    name = processed_data,
    command = process_data(raw_data)
  ),
  tar_target(
    name = basic_stats,
    command = calculate_basic_statistics(processed_data)
  ),
  tar_target(
    name = simple_plot,
    command = create_plot(processed_data)
  ),
  tar_target(
    name = model_results,
    command = run_linear_reg(processed_data)
  )
)

Let’s start writing code to create the four items we listed above: some descriptive statistics, a plot of some discrete data, a plot of the continuous lipid variables, and a report (R Markdown). Since we’ll use {tidyverse}, specifically {dplyr}, to calculate the summary statistics, we need to add it to our dependencies and install it in the project library. {tidyverse} is a special “meta”-package so we need to add it to the "depends" section of the DESCRIPTION file.

use_package("tidyverse", "depends")
use_package("dplyr")

Commit the changes made to the renv.lock and DESCRIPTION file in the Git history. Open up the doc/lesson.qmd file and create a new header and code chunk at the bottom of the file.

## Basic statistics

```{r setup}
library(tidyverse)
source(here::here("R/functions.R"))
lipidomics <- read_csv(here::here("data/lipidomics.csv"))
```

```{r basic-stats}

```

We want to calculate the mean and SD for each metabolite and then, to make it more readable, to round the numbers to one digit. We covered this in the functionals session of the intermediate course, so we will apply these same principles and code here. To do that, we need to use group_by() on the metabolites, use across() inside summarise() so we can give it the mean() and sd() functions, followed by mutate()ing each numeric column (across(where(is.numeric))) to round() to 1 digit. Let’s write out the code!

lipidomics %>%
  group_by(metabolite) %>%
  summarise(across(value, list(mean = mean, sd = sd))) %>%
  mutate(across(where(is.numeric), round, digits = 1))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(where(is.numeric), round, digits = 1)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function
instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
# A tibble: 12 × 3
   metabolite               value_mean value_sd
   <chr>                         <dbl>    <dbl>
 1 CDCl3 (solvent)               180       67  
 2 Cholesterol                    18.6     11.4
 3 FA -CH2CH2COO-                 33.6      7.8
 4 Lipid -CH2-                   537.      61.9
 5 Lipid CH3- 1                   98.3     73.8
 6 Lipid CH3- 2                  168.      29.2
 7 MUFA+PUFA                      32.9     16.1
 8 PUFA                           30       24.1
 9 Phosphatidycholine             31.7     20.5
10 Phosphatidylethanolamine       10        7.6
11 Phospholipids                   2.7      2.6
12 TMS (interntal standard)      123      130. 

After that, style the file using the Palette (, then type “style file”) on the file. Than we will commit the changes to the Git history.

6.5 Exercise: Convert summary statistics code into a function

Time: ~25 minutes.

While inside the doc/lesson.qmd file, use the “function-oriented” workflow, as taught in the intermediate course, to take the code we wrote above and convert it into a function. Complete these tasks:

  1. Wrap the code with function() {...} and name the new function descriptive_stats.
  2. Replace lipidomics with data and put data as an argument inside the brackets of function().
  3. Add dplyr:: to the start of each {dplyr} function used inside your function (except for where(), which comes from the {tidyselect} package).
  4. Style the code using the Palette (, then type “style file”) to make sure it is formatted correctly. You might need to manually force a styling if lines are too long.
  5. With the cursor inside the function, add some roxygen documentation with Ctrl-Shift-P followed by typing “roxygen comment”. Remove the lines that contain @examples and @export, then fill in the other details (like the @params and Title). In the @return section, write “A data.frame/tibble.”
  6. Cut and paste the function over into the R/functions.R file.
  7. Source the R/functions.R file (Ctrl-Shift-S) and then test the code by running descriptive_stats(lipidomics) in the Console. If it works, do the last task.
  8. Save both files and then open the Git interface and commit the changes you made to them.
Note

In the intermediate course, we highly suggested using return() at the end of the function. Technically we don’t need an explicit return(), since the output of the last code that R runs within the function will be the output of the function. This is called an “implicit return” and we will be using this feature throughout the rest of this course.

Here is some scaffolding to help you get started:

descriptive_stats <- function(___) {
  ___
}
Click for a potential solution. Only click if you are struggling or are out of time.
#' Calculate descriptive statistics of each metabolite.
#'
#' @param data Lipidomics dataset.
#'
#' @return A data.frame/tibble.
#'
descriptive_stats <- function(data) {
  data %>%
    dplyr::group_by(metabolite) %>%
    dplyr::summarise(dplyr::across(value, list(mean = mean, sd = sd))) %>%
    dplyr::mutate(dplyr::across(tidyselect::where(is.numeric), round, digits = 1))
}

6.6 Adding a step in the pipeline

Now that we’ve created a function to calculate some basic statistics, we can now add it as a step in the {targets} pipeline. Open up the _targets.R file and go to the end of the file, where the list() and tar_target() code are found. In the first tar_target(), replace the target to load the lipidomic data. In the second, replace it with the descriptive_stats() function. If we want to make it easier to remember what the target output is, we can add df_ to remind us that it is a data frame. It should look like:

list(
  tar_target(
    name = lipidomics,
    command = readr::read_csv(here::here("data/lipidomics.csv"))
  ),
  tar_target(
    name = df_stats_by_metabolite,
    command = descriptive_stats(lipidomics)
  )
)

We haven’t added {readr} as a dependency yet, so let’s add it to the DESCRIPTION file.

use_package("readr")

Let’s run {targets} to see what happens! You can either use the Command Palette (Ctrl-Shift-P, then type “run targets”) or run this code in the Console:

targets::tar_make()

Because of the way {targets} works, we need to first create a pipeline target of only the path to the data file with the format = "file", and then we can load it with {readr}.

list(
    tar_target(
        name = file,
        command = "data/lipidomics.csv",
        format = "file"
    ),
    tar_target(
        name = lipidomics,
        command = readr::read_csv(file, show_col_types = FALSE)
    ),
    tar_target(
        name = df_stats_by_metabolite,
        command = descriptive_stats(lipidomics)
    )
)

Now, let’s try running {targets} again! You can either use the Command Palette (Ctrl-Shift-P, then type “run targets”) or run this code in the Console:

targets::tar_make()

It probably won’t run though. That’s because {targets} doesn’t know about the packages that you need for the pipeline. To add it, we need to go to the tar_option_set() section of the _targets.R file and replace the packages = c("tibble") with all the packages we use. Instead of typing each package out, we can paste them directly into the script using code, since we already track the packages in the DESCRIPTION file. To access the dependencies, we can use this function:

desc::desc_get_deps()$package

But this code also picks up the dependency on R itself. Usually the R dependency is listed as the first item. If it isn’t, we will run:

use_tidy_description()

This function fixes up the DESCRIPTION file, with R now being the first item. Using the same code as above from the {desc} package, we can remove R with[-1] and put it in the place of the packages argument in the _targets.R file:

packages = desc::desc_get_deps()$package[-1]

Try running {targets} again with either targets::tar_make() or the Command Palette (Ctrl-Shift-P, then type “run targets”). It should run through! We also see that a new folder has been created called _targets/. Since we don’t want to track this in Git, we should ignore it with:

use_git_ignore("_targets")

We can visualize our individual pipeline targets that we track through tar_target() now too, which can be useful as you add more and more targets. We will (likely) need to install an extra package (done automatically):

targets::tar_visnetwork()

Or to see what pipeline targets are outdated:

targets::tar_outdated()

Before continuing, let’s commit the changes to the Git history.

6.7 Exercise: Update function to also calculate median and IQR

Time: ~8 minutes.

Let’s make a change to our function and test out how the tar_outdated() and tar_visnetwork() work.

  1. Open up the R/functions.R file.
  2. Add median and interquartile range (IQR) to the summarise() function, by adding it to the end of list(mean = mean, sd = sd), after the second sd. Note, IQR should look like iqr = IQR since we want the output columns to have a lowercase for the column names.
  3. Run tar_outdated() and tar_visnetwork() in the Console (or by using the Command Palette Ctrl-Shift-P, then “targets outdated” or “targets visualize”). What does it show?
  4. Style using the Palette (, then type “style file”). You might need to force a reformat if the code is too long by highlighting the line and using Ctrl-Shift-P, then “reformat”.
  5. Run tar_make() in the Console (or Ctrl-Shift-P, then “targets run”, selecting the “background” option). Re-check for outdated targets and visualize the network again.
  6. Open up the Git interface and commit the changes to the Git history.
Click for a potential solution. Only click if you are struggling or are out of time.
#' Calculate descriptive statistics of each metabolite.
#'
#' @param data Lipidomics dataset.
#'
#' @return A data.frame/tibble.
#'
descriptive_stats <- function(data) {
  data %>%
    dplyr::group_by(metabolite) %>%
    dplyr::summarise(dplyr::across(value, list(
      mean = mean,
      sd = sd,
      median = median,
      iqr = IQR
    ))) %>%
    dplyr::mutate(dplyr::across(tidyselect::where(is.numeric), round, digits = 1))
}

6.8 Creating figure outputs

Not only can we create data frames with targets (like above), but also figures. Let’s write some code to create the plots for our items 2 and 3 that we initially decided at the start of this session. Since we’re using {ggplot2} to write this code, let’s add it to our DESCRIPTION file and to our {renv} project library.

use_package("ggplot2")

Next, we’ll write the code to create item 2: The bar plot of the counts of gender and class. We’ll keep it simple for now. Because the data is structured in a long format, we need to trim it down to only the unique cases of code, gender, and class with distinct(). Then we can make the bar plot. We’ll use position = "dodge" for the bars to be side by side, rather than stacked.

gender_by_class_plot <- lipidomics %>%
  distinct(code, gender, class) %>%
  ggplot(aes(x = class, fill = gender)) +
  geom_bar(position = "dodge")
gender_by_class_plot

Next, let’s do item 3, the distribution of each metabolite. Here we’ll use geom_histogram(), nothing too fancy. And since the data is already in long format, we can easily use facet_wrap() to create a plot for each metabolite. We use scales = "free" because each metabolite doesn’t have the same range of values (some are small, others are quite large).

metabolite_distribution_plot <- ggplot(lipidomics, aes(x = value)) +
  geom_histogram() +
  facet_wrap(vars(metabolite), scales = "free")
metabolite_distribution_plot
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We now have the basic code to convert over into functions.

6.9 Exercise: Convert the plot code to a function

Time: ~20 minutes.

For now, we will only take the code to make the bar plot and convert it into a function. Just like you did with the descriptive_stats() function in the exercise above, complete these tasks:

  1. Wrap the code with function() {...} and name the new function plot_count_stats.
  2. Replace lipidomics with data and put data as an argument inside the brackets of function().
  3. Add ggplot2:: and dplyr:: to the start of each {ggplot2} and {dplyr} function used inside your function.
  4. Style using the Palette (, then type “style file”) to make sure it is formatted correctly. You might need to manually force a styling if lines are too long.
  5. With the cursor inside the function, add some roxygen documentation with Ctrl-Shift-P followed by typing “roxygen comment”. Remove the lines that contain @examples and @export, then fill in the other details (like the @params and Title). In the @return section, write “A plot object.”
  6. Cut and paste the function over into the R/functions.R file.
  7. Source the R/functions.R file (Ctrl-Shift-S) and then test the code by running plot_count_stats(lipidomics) in the Console. If it works, do the last task.
  8. Save both files and then open the Git interface and commit the changes you made to them.
  9. Repeat these tasks for the code for the metabolite distribution plot. Call this new function plot_distributions.

Use this scaffolding code to help guide you to write the code into a function.

plot_count_stats <- function(___) {
  ___
}

plot_distributions <- function(___) {
  ___
}
Click for the solution. Only click if you are struggling or are out of time.
## This should be in the R/functions.R file.
#' Plot for basic count data.
#'
#' @param data The lipidomics dataset.
#'
#' @return A ggplot2 graph.
#'
plot_count_stats <- function(data) {
  data %>%
    dplyr::distinct(code, gender, class) %>%
    ggplot2::ggplot(ggplot2::aes(x = class, fill = gender)) +
    ggplot2::geom_bar(position = "dodge")
}

#' Plot for basic distribution of metabolite data.
#'
#' @param data The lipidomics dataset.
#'
#' @return A ggplot2 graph.
#'
plot_distributions <- function(data) {
  data %>% 
    ggplot2::ggplot(ggplot2::aes(x = value)) +
    ggplot2::geom_histogram() +
    ggplot2::facet_wrap(ggplot2::vars(metabolite), scales = "free")
}

6.10 Adding the plots as pipeline targets

Now, let’s add the two functions to the _targets.R file. Let’s write these two tar_target() items within the list() inside _targets.R. To make it easier to track things, add fig_ to the start of the name given.

list(
  ...,
  tar_target(
    name = fig_gender_by_class,
    command = plot_count_stats(lipidomics)
  ),
  tar_target(
    name = fig_metabolite_distribution,
    command = plot_distributions(lipidomics)
  )
)

Next, test that it works by running targets::tar_visnetwork() (or Ctrl-Shift-P, then type “target visual”) or targets::tar_outdated(). You should see that the new items are “outdated”. Then run targets::tar_make() (Ctrl-Shift-P, then “targets run”) to update the pipeline. If it all works, than commit the changes to the Git history.

6.11 Incorporating targets with R Markdown

Last, but not least, we want to make our final item: The R Markdown document. Adding an R Markdown document as a target inside _targets.R is fairly straightforward. We need to install the helper package {tarchetypes} first:

use_package("tarchetypes")

Then, inside _targets.R, uncomment the line where library(tarchetypes) is commented out. The function we need to use to build the R Markdown file is tar_render(), which needs two things: The name, like tar_target() needs, and the file path to the R Markdown file. Again, like the other tar_target() items, add it to the end of the list(). Since we’re using the doc/lesson.qmd as a sandbox, we won’t include it as a pipeline target. Instead we will use the doc/report.Rmd file:

list(
  ...,
  tar_render(
    name = report_rmd, 
    path = "doc/report.Rmd"
  )
)

Now when we run targets::tar_make() (Ctrl-Shift-P, then type “targets run”), the R Markdown file also gets re-built. But when we use targets::tar_visnetwork(), we don’t see the connections with plots and descriptive statistics. That’s because we haven’t used them in a way {targets} can recognize. For that, we need to use the function targets::tar_read().

Let’s open up the doc/report.Rmd file, add a setup code chunk below the YAML header, and create a new header and code chunk and make use of the targets::tar_read()

```{{r setup}}
library(tidyverse)
library(targets)
lipidomics <- tar_read(lipidomics)
```

6.12 Results

```{r}
tar_read(df_stats_by_metabolite)
```
```{r}
tar_read(fig_gender_by_class)
```
```{r}
tar_read(fig_distribution_metabolites)
```

With targets::tar_read(), we can access all the stored target items using syntax like we would with {dplyr}, without quotes. For the df_stats_by_metabolite, we can do some minor wrangling with mutate() and glue::glue(), and than pipe it to knitr::kable() to create a table in the output document. The {glue} package is really handy for formatting text based on columns. If you use {} inside a quoted string, you can use columns from a data frame, like value_mean. So we can use it to format the final table text to be mean value (SD value):

targets::tar_read(df_stats_by_metabolite) %>% 
  mutate(MeanSD = glue::glue("{value_mean} ({value_sd})")) %>%
  select(Metabolite = metabolite, `Mean SD` = MeanSD) %>%
  knitr::kable(caption = "Descriptive statistics of the metabolites.")
Descriptive statistics of the metabolites.
Metabolite Mean SD
CDCl3 (solvent) 180 (67)
Cholesterol 18.6 (11.4)
FA -CH2CH2COO- 33.6 (7.8)
Lipid -CH2- 536.6 (61.9)
Lipid CH3- 1 98.3 (73.8)
Lipid CH3- 2 168.2 (29.2)
MUFA+PUFA 32.9 (16.1)
PUFA 30 (24.1)
Phosphatidycholine 31.7 (20.5)
Phosphatidylethanolamine 10 (7.6)
Phospholipids 2.7 (2.6)
TMS (interntal standard) 123 (130.4)

Re-run targets::tar_visnetwork() (Ctrl-Shift-P, then type “targets visual”) to see that it now detects the connections between the pipeline targets. Now, run targets::tar_make() (Ctrl-Shift-P, then type “targets run”) again to see everything re-build! Last things are to re-style using the Palette (, then type “style file”), then commit the changes to the Git history before moving on. Then push your changes up to GitHub.

6.13 Fixing issues in the stored pipeline data

Reading task: ~10 minutes

Like with {renv}, sometimes you need to start from the beginning and clean everything up because there’s an issue that you can’t seem to fix. In this case, {targets} has a few functions to help out. Here are four that you can use to delete stuff (also described on the targets book):

tar_invalidate()

This removes the metadata on the target in the pipeline, but doesn’t remove the object itself (which tar_delete() does). This will tell {targets} that the target is out of date, since it has been removed, even though the data object itself isn’t present. You can use this like you would select(), by naming the objects directly or using the {tidyselect} helpers (e.g. everything(), starts_with()).

tar_delete()

This deletes the stored objects (e.g. the lipidomics or df_stats_by_metabolite) inside _targets/, but does not delete the record in the pipeline. So {targets} will see that the pipeline doesn’t need to be rebuilt. This is useful if you want to remove some data because it takes up a lot of space, or, in the case of GDPR and privacy rules, you don’t want to store any sensitive personal health data in your project. Use it like tar_invalidate(), with functions like everything() or starts_with().

tar_prune()

This function is useful to help clean up left over or unused objects in the _targets/ folder. You will probably not use this function too often.

tar_destroy()

The most destructive, and probably more commonly used, function. This will delete the entire _targets/ folder for those times when you want to start over and re-run the entire pipeline again.

6.14 Summary

  • Use a function-oriented workflow together with {targets} to build your data analysis pipeline and track your “pipeline targets”.
  • List individual “pipeline targets” using tar_target() within the _targets.R file.
  • Visualize target items in your pipeline with targets::tar_visnetwork() or list outdated items with targets::tar_outdated().
  • Within R Markdown files, use targets::tar_read() to access saved pipeline outputs. To include the R Markdown in the pipeline, use {tarchetypes} and the function tar_render().
  • Delete stored pipeline output with tar_delete().