21 Extend your analysis to run many models

🧑‍🏫 Teacher note

Before beginning, get them to recall what they remember of the previous session, either with something like Mentimeter, verbally, or using hands / stickies / origami hats. Preferably something that will allow everyone to participate, not just the ones who are more comfortable being vocal to the whole group.

Depending on what they write, you may need to briefly go over the previous session.

Rarely do we run only one single statistical model to answer one single question, especially in our data-overflowing environments. An initial instinct when faced with this task might be to copy-and-paste the same code used for one analysis to another, then slightly modify the code each time. Or, if you have heard of loops or used them in other programming languages, you might think to create a loop. Thankfully R uses something more powerful and expressive than either of those approaches, and that is functional programming. Using functional programming concepts, we can use little code to express complex actions and run large numbers of statistical analyses. This session will be about using functional programming in the context of statistical analysis.

21.1 Learning objectives

Recall principles of functional programming.
Design analyses to make use of the principles of functional programming and the split-apply-combine technique.
Apply these principles to write code that does many things with little code, in particular by using the purrr package.

21.2 📖 Reading task: Apply logistic regression to each metabolite with functional programming

🧑‍🏫 Instructor note

Reinforce the concept of functional programming by briefly going over:

Figure 8.2 that visualizes functionals like map().
Highlight the use of the anonymous function \(x) syntax to pass arguments to functions within functionals.
The split-apply-combine figure, emphasizing in particular how data is split into smaller pieces but contained as a list.

Time: ~15 minutes.

Functional programming underlies many core features of running statistical methods on data. Because it is such an important component for this session, you’ll briefly review the concepts of functional programming by reading some sections from the intermediate workshop:

Read only section 8.3 in the Function Programming session.
Read only the section 10.2 about split-apply-combine.

Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

21.3 💬 Discussion activity: Design a flow for running many models

Time: ~12 minutes.

Using what you just read about functional programming and split-apply-combine, take some time to brainstorm how you might design the rest of the analysis. As with any code, there are many ways of writing code to achieve the same goal, so no single answer is “correct”. The code we will eventually write later in this session is only one of other ways of doing this. Even still, try not to look ahead! 😉

For 4 minutes, sketch out and design on your own a flow to analyse all the metabolites and the models without using much more code.
With your neighbour, take 4 minutes to share what you both brainstormed about and try to improve on what you each came up with.
Then, as a group, we’ll take 4 minutes to discuss and share some ideas.

21.4 Using functional programming to run multiple models

There are many ways that you can run a model on each metabolite based on the lipidomics dataset. However, the “split-apply-combine” approach is a very good way of doing this. Go into your docs/learning.qmd file and go to the bottom of the file. Create a new header and code chunk that looks like:

docs/learning.qmd

## Running multiple models

```{r}

```

As we want to run two models for each metabolite and we’ve already been working with the cholesterol metabolite, let’s run those two models for cholesterol before we apply it to all metabolites. We’re going to eventually use the functionals from the purrr package. So let’s first add purrr as a build dependency:

Console

use_package("purrr")

Now, let’s go back to the docs/learning.qmd file and start building up the code. First, let’s make a subset of the data that’s ready to be fit with the model, like we did before. We’ll call it just_cholesterol again:

docs/learning.qmd

just_cholesterol <- lipidomics |>
  filter(metabolite == "Cholesterol") |>
  preprocess()

Recall from the diagrams above that, as much as possible, we want to work with list()s when dealing with multiple items like models. Also recall from the Chapter 19 session that we can treat the model formulas as objects too, that we can modify and work with. So let’s start with making a list of the two models: class ~ value and class ~ value + gender + age.

docs/learning.qmd

list(
  class ~ value,
  class ~ value + gender + age
)

[[1]]
class ~ value

[[2]]
class ~ value + gender + age

Now, let’s pipe this to the functional map() so that we can apply fit_model() to each model formula in the list(), given our data. The diagram below visualises what happens when you use map() and fit_model() on a list of models:

🧑‍🏫 Teacher note

Show this diagram and explain how map applies the function to each item inside of the list, creating a new list with the results.

Figure 21.1: Using map() to apply fit_model() to each model formula in a list.

Because the fit_model() function we created earlier takes two arguments, data and model, we need to tell map() how to use those arguments properly. Normally, map() will give the items from the list to the first argument of the function, which is data in our case. However, we want the items from the list to go to the model argument instead.

So, we have to use the “anonymous function” \(x) syntax to correctly put the model items. In the anonymous function, we’ll use model as an argument, since that is what is coming from list() and then use just_cholesterol as the data argument in fit_model().

docs/learning.qmd

list(
  class ~ value,
  class ~ value + gender + age
) |>
  map(\(model) fit_model(just_cholesterol, model = model))

[[1]]
# A tibble: 2 × 7
  metabolite  model         term    estimate std.error statistic p.value
  <chr>       <chr>         <chr>      <dbl>     <dbl>     <dbl>   <dbl>
1 Cholesterol class ~ value (Inter…     1.02     0.364    0.0425  0.966 
2 Cholesterol class ~ value value       2.65     0.450    2.17    0.0300

[[2]]
# A tibble: 4 × 7
  metabolite  model           term  estimate std.error statistic p.value
  <chr>       <chr>           <chr>    <dbl>     <dbl>     <dbl>   <dbl>
1 Cholesterol class ~ value … (Int…    1.11     1.29      0.0817  0.935 
2 Cholesterol class ~ value … value    2.97     0.458     2.38    0.0175
3 Cholesterol class ~ value … gend…    0.493    0.779    -0.907   0.365 
4 Cholesterol class ~ value … age      1.01     0.0377    0.183   0.855

Look at that! Two items in our output list, one for each model. Since it is nicer to work with data frames rather than lists, we can combine it together using list_rbind():

docs/learning.qmd

list(
  class ~ value,
  class ~ value + gender + age
) |>
  map(\(model) fit_model(just_cholesterol, model = model)) |>
  list_rbind()

# A tibble: 6 × 7
  metabolite  model           term  estimate std.error statistic p.value
  <chr>       <chr>           <chr>    <dbl>     <dbl>     <dbl>   <dbl>
1 Cholesterol class ~ value   (Int…    1.02     0.364     0.0425  0.966 
2 Cholesterol class ~ value   value    2.65     0.450     2.17    0.0300
3 Cholesterol class ~ value … (Int…    1.11     1.29      0.0817  0.935 
4 Cholesterol class ~ value … value    2.97     0.458     2.38    0.0175
5 Cholesterol class ~ value … gend…    0.493    0.779    -0.907   0.365 
6 Cholesterol class ~ value … age      1.01     0.0377    0.183   0.855

Amazing, we now have a data frame with the results from both models for cholesterol! Now, let’s convert this code into a function so that we can easily apply it to all metabolites. We’ll call this function fit_all_models(), which takes one argument, data, which will be the lipidomics data frame.

docs/learning.qmd

#' Fit all models to a given data frame.
#'
#' @param data The data frame to fit the models to.
#'
#' @returns A data frame with results from all models.
#'
fit_all_models <- function(data) {
  list(
    class ~ value,
    class ~ value + gender + age
  ) |>
    purrr::map(\(model) fit_model(data, model = model)) |>
    purrr::list_rbind()
}

docs/learning.qmd

lipidomics |>
  filter(metabolite == "Cholesterol") |>
  preprocess() |>
  fit_all_models()

# A tibble: 6 × 7
  metabolite  model           term  estimate std.error statistic p.value
  <chr>       <chr>           <chr>    <dbl>     <dbl>     <dbl>   <dbl>
1 Cholesterol class ~ value   (Int…    1.02     0.364     0.0425  0.966 
2 Cholesterol class ~ value   value    2.65     0.450     2.17    0.0300
3 Cholesterol class ~ value … (Int…    1.11     1.29      0.0817  0.935 
4 Cholesterol class ~ value … value    2.97     0.458     2.38    0.0175
5 Cholesterol class ~ value … gend…    0.493    0.779    -0.907   0.365 
6 Cholesterol class ~ value … age      1.01     0.0377    0.183   0.855

Woohoo! We have a function that fits all models to a given data frame. We won’t go over what each column and row means just yet, we’ll do that in the next session.

Alright, let’s style this file by using the Palette (Ctrl-Shift-P, then type “style file”) before it’s time for you to convert this into a function. Commit this to the Git history with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”) and then push to GitHub.

21.5 🧑‍💻 Exercise: Piece it all together to run all models at once!

Time: ~15 minutes.

We’ve been building all our functions to have one argument, to input a data frame, and output a data frame. Because of that, we can now chain them together using map()! But first, we need to split the original lipidomics data frame into a list of data frames, one for each metabolite. There’s an incredibly useful function for that called group_split(). What it does can be visualized like this:

Figure 21.2: Using group_split(metabolite) to create a list of data frames, each contains one metabolite.

Caution

Splitting your data by one or more variables into a list of data frames is a powerful way of applying functions to each data frame all at once. However, be cautious when using this approach with very large datasets, as it can lead to high memory usage.

A simple way to determine if your data is too big to do this is to check if your data fits into your own computer’s or a server’s memory on its own. If it can’t, there are other approaches to use to do similar things. More likely, if your data is so large, you have many other problems and challenges to deal with than what we cover in this workshop.

Using group_split() in our case looks like this (we’re only showing the first two items on the website):

lipidomics |>
  group_split(metabolite)

<list_of<
  tbl_df<
    code      : character
    gender    : character
    age       : double
    class     : character
    metabolite: character
    value     : double
  >
>[2]>
[[1]]
# A tibble: 36 × 6
   code   gender   age class metabolite      value
   <chr>  <chr>  <dbl> <chr> <chr>           <dbl>
 1 ERI109 M         25 CT    CDCl3 (solvent) 166. 
 2 ERI111 M         39 CT    CDCl3 (solvent) 171. 
 3 ERI140 M         25 T1D   CDCl3 (solvent)  84.6
 4 ERI142 M         38 T1D   CDCl3 (solvent)  70.8
 5 ERI143 M         43 T1D   CDCl3 (solvent) 204. 
 6 ERI144 M         35 T1D   CDCl3 (solvent) 108. 
 7 ERI145 M         36 T1D   CDCl3 (solvent) 133. 
 8 ERI146 M         38 T1D   CDCl3 (solvent) 110. 
 9 ERI147 W         37 T1D   CDCl3 (solvent) 161. 
10 ERI149 W         35 T1D   CDCl3 (solvent) 167. 
# ℹ 26 more rows

[[2]]
# A tibble: 36 × 6
   code   gender   age class metabolite  value
   <chr>  <chr>  <dbl> <chr> <chr>       <dbl>
 1 ERI109 M         25 CT    Cholesterol 18.6 
 2 ERI111 M         39 CT    Cholesterol 20.8 
 3 ERI140 M         25 T1D   Cholesterol 13.9 
 4 ERI142 M         38 T1D   Cholesterol 15.4 
 5 ERI143 M         43 T1D   Cholesterol  8.93
 6 ERI144 M         35 T1D   Cholesterol 18.7 
 7 ERI145 M         36 T1D   Cholesterol 16.8 
 8 ERI146 M         38 T1D   Cholesterol 17.9 
 9 ERI147 W         37 T1D   Cholesterol 15.9 
10 ERI149 W         35 T1D   Cholesterol 15.2 
# ℹ 26 more rows

Complete these tasks so that the code will analyze all metabolites and all models:

Go to the bottom of you docs/learning.qmd file and create a new header and code chunk for this exercise.
In the code chunk, write or copy and paste the code above that uses group_split(metabolite) into that code chunk. Run the code to see how it looks.
Now, use the steps we’ve been building towards, where each data frame gets processed with preprocess() and then fit_all_models(). But instead, each function is used within map() and finally combined into one data frame with list_rbind(). Build up this code step-by-step, running it each time to see what it does and how it looks.
Once you have the code working, style the file with the Palette (Ctrl-Shift-P, then type “style file”). Commit your changes with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type “commit”) and push to GitHub.

Click for the solution. Only click if you are struggling or are out of time.

lipidomics |>
  group_split(metabolite) |>
  map(preprocess) |>
  map(fit_all_models) |>
  list_rbind()

Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

21.6 🧑‍💻 Exercise: Update the `create_model_results()` function and run the pipeline!

Time: ~15 minutes.

The final piece of the puzzle: include this code into the pipeline so we can get the results of all the models! First, open up the R/functions.R file and find the create_model_results() function. It currently only runs the model for cholesterol, as we use filter() in the function. Update this function so that it uses the code from the exercise above to run all the models for all metabolites. The code currently looks like:

R/functions.R

create_model_results <- function(data) {
  data |>
    dplyr::filter(metabolite == "Cholesterol") |>
    preprocess() |>
    fit_model(class ~ value)
}

Revise the function so that it uses group_split(metabolite), map(), and list_rbind() to run all the models for all metabolites.
Style the file with the Palette (Ctrl-Shift-P, then type “style file”).
Once you’ve updated the function, check the targets visualization with the Palette (Ctrl-Shift-P, then type “targets visual”) to see how the pipeline has been changed. Check out the outdated targets with the Palette (Ctrl-Shift-P, then type “targets outdated”).
Then, run the pipeline with the Palette (Ctrl-Shift-P, then type “targets run”) to get the updated model results.
Open the docs/learning.qmd file and go to the bottom of the file. Create a new header there called ## Appendix along with a code chunk that outputs the model_results target with tar_read().
Check the targets pipeline with the Palette (Ctrl-Shift-P, then type “targets visual”) again to see what has changed. Then run the pipeline again with the Palette (Ctrl-Shift-P, then type “targets run”).

Click for the solution. Only click if you are struggling or are out of time.

R/functions.R

create_model_results <- function(data) {
  data |>
    dplyr::group_split(metabolite) |>
    purrr::map(preprocess) |>
    purrr::map(fit_all_models) |>
    purrr::list_rbind()
}

Click for the solution. Only click if you are struggling or are out of time.

docs/learning.qmd

tar_read(model_results)

Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

21.7 Summary

Consistently create small functions that do a specific conceptual action, ideally with only one argument, and chain them together into larger conceptual actions, which can then more easily be incorporated into a targets pipeline. Small, multiple functions combined together are easier to manage than fewer, bigger ones.
Use functional programming with map(), as part of the function-oriented workflow, to run multiple models efficiently and with minimal code.

21.8 Code used in session

This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.

use_package("purrr")
just_cholesterol <- lipidomics |>
  filter(metabolite == "Cholesterol") |>
  preprocess()
list(
  class ~ value,
  class ~ value + gender + age
)
list(
  class ~ value,
  class ~ value + gender + age
) |>
  map(\(model) fit_model(just_cholesterol, model = model))
list(
  class ~ value,
  class ~ value + gender + age
) |>
  map(\(model) fit_model(just_cholesterol, model = model)) |>
  list_rbind()
lipidomics |>
  group_split(metabolite)
lipidomics |>
  group_split(metabolite) |>
  map(preprocess) |>
  map(fit_all_models) |>
  list_rbind()

21.1 Learning objectives

21.2 📖 Reading task: Apply logistic regression to each metabolite with functional programming

21.3 💬 Discussion activity: Design a flow for running many models

21.4 Using functional programming to run multiple models

21.5 🧑‍💻 Exercise: Piece it all together to run all models at once!

21.6 🧑‍💻 Exercise: Update the create_model_results() function and run the pipeline!

21.7 Summary

21.8 Code used in session

21.6 🧑‍💻 Exercise: Update the `create_model_results()` function and run the pipeline!