Want to help out or contribute?

If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.

On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.

8  Efficiently running many analyses at once

Before beginning, get them to recall what they remember of the previous session, either with something like Mentimeter or verbally. Preferably something like Mentimeter because it allows everyone to participate, not just the ones who are more comfortable being vocal to the whole group.

Depending on what they write, might need to briefly go over the previous session.

Rarely do we run only one single statistical model to answer one single question, especially in our data-overflowing environments. An initial instinct when faced with this task might be to copy-and-paste, then slightly modify the code each time. Or, if you have heard of loops or used them in other programming languages, you might think to create a loop. Thankfully R uses something more powerful and expressive than either of those approaches, and that is functional programming. Using functional programming concepts, we can use little code to express complex actions and run large numbers of statistical analyses.

Connected to the concept of functional programming, is the idea of resampling a dataset multiple times and running the statistical analysis on each resampled set to calculate a more accurate measure of uncertainty. We often use generic calculations of uncertainty like the standard error or the confidence interval. Those are useful measures especially with very large datasets, however, they have limitations of their own. By making use of resampling, we can identify how uncertain or unreliable a statistical result might be for our specific dataset. This session will be about using functional programming in the context of statistical analysis and learning about other methods of determining uncertainty.

8.1 Learning objectives

The overall objective for this session is to:

  1. Describe the basic framework underlying most statistical analyses and use R to generate statistical results using this framework.

More specific objectives are to:

  1. Recall principles of functional programming and apply them to running statistical analyses by using the purrr package.
  2. Describe what a resampling technique is, the types available, and why it can help estimate the variability of model results. Apply functions from rsample and tune to use these techniques.
  3. Continue applying the concepts and functions used from the previous sessions.

Specific “anti”-objectives:

  • Same as the “anti”-objectives of Chapter 7.

8.2 Exercise: How would we use functional programming to run multiple models?

Time: ~20 minutes.

Functional programming underlies many core features of running statistical methods on data. This exercise is meant for you to review this concept and try to think of it in the context of statistical modeling.

  • For 10 minutes, go to the sections on Function Programming in the Intermediate R course as well as the split-apply-combine and review the concepts.

  • For 8 minutes, discuss with your neighbour how we can use functional programming to apply the statistical model to each metabolite. Try to think how the code would look. You don’t need to write real R code, but if writing pseudocode helps, go right ahead! Also, don’t look ahead 😉

  • For the remaining time, we will discuss our thoughts in the whole group.

After they’ve finished, either write pseudocode in RStudio or draw this out on a whiteboard if it is available. There will probably be several different approaches, many of which could also be implemented just fine. Ultimately we will replace create_recipe_spec(metabolite_...) with create_recipe_spec(starts_with("metabolite_")).

8.3 Apply logistic regression to each metabolite

You may have thought of many different ways to run the model on each metabolite based on the lipidomics_wide dataset. However, these types of “split-apply-combine” tasks are (usually) best done using data in the long form. So we’ll start with the original lipidomics dataset. Create a header and code chunk at the end of the doc/lesson.Rmd file:

## Running multiple models

```{r}

```

The first thing we want to do is convert the metabolite names into snake case:

lipidomics %>%
  column_values_to_snakecase(metabolite)
# A tibble: 504 × 6
   code   gender   age class metabolite                value
   <chr>  <chr>  <dbl> <chr> <chr>                     <dbl>
 1 ERI109 M         25 CT    tms_interntal_standard   208.  
 2 ERI109 M         25 CT    cholesterol               19.8 
 3 ERI109 M         25 CT    lipid_ch_3_1              44.1 
 4 ERI109 M         25 CT    lipid_ch_3_2             147.  
 5 ERI109 M         25 CT    cholesterol               27.2 
 6 ERI109 M         25 CT    lipid_ch_2               587.  
 7 ERI109 M         25 CT    fa_ch_2_ch_2_coo          31.6 
 8 ERI109 M         25 CT    pufa                      29.0 
 9 ERI109 M         25 CT    phosphatidylethanolamine   6.78
10 ERI109 M         25 CT    phosphatidycholine        41.7 
# … with 494 more rows

The next step is to split the data up. We could use group_by(), but in order to make the most use of purrr functions like map(), we will use group_split() to convert the data frame into a set of lists1. Let’s first add purrr as a dependency:

  • 1 There is probably a more computationally efficient way of coding this instead of making a list, but as the saying goes “premature optimization is the root of all evil”. For our purposes, this is a very good approach, but for very large datasets and hundreds of potential models to run, this method would need to be optimized some more.

  • use_package("purrr")

    Then we run group_split() on the metabolite column, which will output a lot of data frames as a list.

    lipidomics %>%
      column_values_to_snakecase(metabolite) %>%
      group_split(metabolite)
    <list_of<
      tbl_df<
        code      : character
        gender    : character
        age       : double
        class     : character
        metabolite: character
        value     : double
      >
    >[12]>
    [[1]]
    # A tibble: 36 × 6
       code   gender   age class metabolite      value
       <chr>  <chr>  <dbl> <chr> <chr>           <dbl>
     1 ERI109 M         25 CT    cd_cl_3_solvent  166.
     2 ERI111 M         39 CT    cd_cl_3_solvent  171.
     3 ERI163 W         58 CT    cd_cl_3_solvent  262.
     4 ERI375 M         24 CT    cd_cl_3_solvent  172.
     5 ERI376 M         26 CT    cd_cl_3_solvent  300.
     6 ERI391 M         31 CT    cd_cl_3_solvent  241.
     7 ERI392 M         24 CT    cd_cl_3_solvent  172.
     8 ERI79  W         26 CT    cd_cl_3_solvent  148.
     9 ERI81  M         52 CT    cd_cl_3_solvent  168.
    10 ERI83  M         25 CT    cd_cl_3_solvent  253.
    # … with 26 more rows
    
    [[2]]
    # A tibble: 108 × 6
       code   gender   age class metabolite  value
       <chr>  <chr>  <dbl> <chr> <chr>       <dbl>
     1 ERI109 M         25 CT    cholesterol 19.8 
     2 ERI109 M         25 CT    cholesterol 27.2 
     3 ERI109 M         25 CT    cholesterol  8.88
     4 ERI111 M         39 CT    cholesterol 22.8 
     5 ERI111 M         39 CT    cholesterol 30.2 
     6 ERI111 M         39 CT    cholesterol  9.28
     7 ERI163 W         58 CT    cholesterol 14.9 
     8 ERI163 W         58 CT    cholesterol 24.0 
     9 ERI163 W         58 CT    cholesterol  7.66
    10 ERI375 M         24 CT    cholesterol 19.2 
    # … with 98 more rows
    
    [[3]]
    # A tibble: 36 × 6
       code   gender   age class metabolite       value
       <chr>  <chr>  <dbl> <chr> <chr>            <dbl>
     1 ERI109 M         25 CT    fa_ch_2_ch_2_coo  31.6
     2 ERI111 M         39 CT    fa_ch_2_ch_2_coo  28.9
     3 ERI163 W         58 CT    fa_ch_2_ch_2_coo  36.6
     4 ERI375 M         24 CT    fa_ch_2_ch_2_coo  39.4
     5 ERI376 M         26 CT    fa_ch_2_ch_2_coo  52.1
     6 ERI391 M         31 CT    fa_ch_2_ch_2_coo  42.8
     7 ERI392 M         24 CT    fa_ch_2_ch_2_coo  39.9
     8 ERI79  W         26 CT    fa_ch_2_ch_2_coo  32.7
     9 ERI81  M         52 CT    fa_ch_2_ch_2_coo  28.4
    10 ERI83  M         25 CT    fa_ch_2_ch_2_coo  26.5
    # … with 26 more rows
    
    [[4]]
    # A tibble: 36 × 6
       code   gender   age class metabolite value
       <chr>  <chr>  <dbl> <chr> <chr>      <dbl>
     1 ERI109 M         25 CT    lipid_ch_2  587.
     2 ERI111 M         39 CT    lipid_ch_2  585.
     3 ERI163 W         58 CT    lipid_ch_2  558.
     4 ERI375 M         24 CT    lipid_ch_2  606.
     5 ERI376 M         26 CT    lipid_ch_2  554.
     6 ERI391 M         31 CT    lipid_ch_2  597.
     7 ERI392 M         24 CT    lipid_ch_2  607.
     8 ERI79  W         26 CT    lipid_ch_2  546.
     9 ERI81  M         52 CT    lipid_ch_2  593.
    10 ERI83  M         25 CT    lipid_ch_2  606.
    # … with 26 more rows
    
    [[5]]
    # A tibble: 36 × 6
       code   gender   age class metabolite   value
       <chr>  <chr>  <dbl> <chr> <chr>        <dbl>
     1 ERI109 M         25 CT    lipid_ch_3_1  44.1
     2 ERI111 M         39 CT    lipid_ch_3_1  28.1
     3 ERI163 W         58 CT    lipid_ch_3_1  75.1
     4 ERI375 M         24 CT    lipid_ch_3_1  22.0
     5 ERI376 M         26 CT    lipid_ch_3_1  29.5
     6 ERI391 M         31 CT    lipid_ch_3_1  38.0
     7 ERI392 M         24 CT    lipid_ch_3_1  34.8
     8 ERI79  W         26 CT    lipid_ch_3_1 109. 
     9 ERI81  M         52 CT    lipid_ch_3_1  49.6
    10 ERI83  M         25 CT    lipid_ch_3_1  29.9
    # … with 26 more rows
    
    [[6]]
    # A tibble: 36 × 6
       code   gender   age class metabolite   value
       <chr>  <chr>  <dbl> <chr> <chr>        <dbl>
     1 ERI109 M         25 CT    lipid_ch_3_2  147.
     2 ERI111 M         39 CT    lipid_ch_3_2  153.
     3 ERI163 W         58 CT    lipid_ch_3_2  144.
     4 ERI375 M         24 CT    lipid_ch_3_2  220.
     5 ERI376 M         26 CT    lipid_ch_3_2  282.
     6 ERI391 M         31 CT    lipid_ch_3_2  220.
     7 ERI392 M         24 CT    lipid_ch_3_2  215.
     8 ERI79  W         26 CT    lipid_ch_3_2  153.
     9 ERI81  M         52 CT    lipid_ch_3_2  150.
    10 ERI83  M         25 CT    lipid_ch_3_2  153.
    # … with 26 more rows
    
    [[7]]
    # A tibble: 36 × 6
       code   gender   age class metabolite  value
       <chr>  <chr>  <dbl> <chr> <chr>       <dbl>
     1 ERI109 M         25 CT    mufa_pufa  50.6  
     2 ERI111 M         39 CT    mufa_pufa  53.2  
     3 ERI163 W         58 CT    mufa_pufa  60.7  
     4 ERI375 M         24 CT    mufa_pufa   0.532
     5 ERI376 M         26 CT    mufa_pufa   1.15 
     6 ERI391 M         31 CT    mufa_pufa   0.602
     7 ERI392 M         24 CT    mufa_pufa   0.422
     8 ERI79  W         26 CT    mufa_pufa  36.3  
     9 ERI81  M         52 CT    mufa_pufa  40.1  
    10 ERI83  M         25 CT    mufa_pufa  39.3  
    # … with 26 more rows
    
    [[8]]
    # A tibble: 36 × 6
       code   gender   age class metabolite         value
       <chr>  <chr>  <dbl> <chr> <chr>              <dbl>
     1 ERI109 M         25 CT    phosphatidycholine  41.7
     2 ERI111 M         39 CT    phosphatidycholine  52.9
     3 ERI163 W         58 CT    phosphatidycholine  35.3
     4 ERI375 M         24 CT    phosphatidycholine  66.9
     5 ERI376 M         26 CT    phosphatidycholine  32.7
     6 ERI391 M         31 CT    phosphatidycholine  62.9
     7 ERI392 M         24 CT    phosphatidycholine  64.3
     8 ERI79  W         26 CT    phosphatidycholine  41.0
     9 ERI81  M         52 CT    phosphatidycholine  56.1
    10 ERI83  M         25 CT    phosphatidycholine  57.8
    # … with 26 more rows
    
    [[9]]
    # A tibble: 36 × 6
       code   gender   age class metabolite               value
       <chr>  <chr>  <dbl> <chr> <chr>                    <dbl>
     1 ERI109 M         25 CT    phosphatidylethanolamine  6.78
     2 ERI111 M         39 CT    phosphatidylethanolamine  3.66
     3 ERI163 W         58 CT    phosphatidylethanolamine  3.59
     4 ERI375 M         24 CT    phosphatidylethanolamine  3.59
     5 ERI376 M         26 CT    phosphatidylethanolamine  2.33
     6 ERI391 M         31 CT    phosphatidylethanolamine  1.46
     7 ERI392 M         24 CT    phosphatidylethanolamine  2.00
     8 ERI79  W         26 CT    phosphatidylethanolamine  4.93
     9 ERI81  M         52 CT    phosphatidylethanolamine  5.20
    10 ERI83  M         25 CT    phosphatidylethanolamine  5.01
    # … with 26 more rows
    
    [[10]]
    # A tibble: 36 × 6
       code   gender   age class metabolite    value
       <chr>  <chr>  <dbl> <chr> <chr>         <dbl>
     1 ERI109 M         25 CT    phospholipids  5.58
     2 ERI111 M         39 CT    phospholipids  6.16
     3 ERI163 W         58 CT    phospholipids  5.19
     4 ERI375 M         24 CT    phospholipids  4.20
     5 ERI376 M         26 CT    phospholipids  3.27
     6 ERI391 M         31 CT    phospholipids  4.71
     7 ERI392 M         24 CT    phospholipids  4.14
     8 ERI79  W         26 CT    phospholipids  5.70
     9 ERI81  M         52 CT    phospholipids  5.46
    10 ERI83  M         25 CT    phospholipids  4.89
    # … with 26 more rows
    
    [[11]]
    # A tibble: 36 × 6
       code   gender   age class metabolite value
       <chr>  <chr>  <dbl> <chr> <chr>      <dbl>
     1 ERI109 M         25 CT    pufa       29.0 
     2 ERI111 M         39 CT    pufa       27.4 
     3 ERI163 W         58 CT    pufa       35.5 
     4 ERI375 M         24 CT    pufa        6.92
     5 ERI376 M         26 CT    pufa        3.22
     6 ERI391 M         31 CT    pufa        3.43
     7 ERI392 M         24 CT    pufa        3.52
     8 ERI79  W         26 CT    pufa       18.7 
     9 ERI81  M         52 CT    pufa       20.7 
    10 ERI83  M         25 CT    pufa       18.2 
    # … with 26 more rows
    
    [[12]]
    # A tibble: 36 × 6
       code   gender   age class metabolite             value
       <chr>  <chr>  <dbl> <chr> <chr>                  <dbl>
     1 ERI109 M         25 CT    tms_interntal_standard 208. 
     2 ERI111 M         39 CT    tms_interntal_standard 219. 
     3 ERI163 W         58 CT    tms_interntal_standard  57.1
     4 ERI375 M         24 CT    tms_interntal_standard  19.2
     5 ERI376 M         26 CT    tms_interntal_standard  35.4
     6 ERI391 M         31 CT    tms_interntal_standard  30.4
     7 ERI392 M         24 CT    tms_interntal_standard  21.7
     8 ERI79  W         26 CT    tms_interntal_standard 185. 
     9 ERI81  M         52 CT    tms_interntal_standard 207. 
    10 ERI83  M         25 CT    tms_interntal_standard 322. 
    # … with 26 more rows

    Remember that logistic regression models need each row to be a single person, so we’ll use the functional map() to apply our custom function metabolites_to_wider() on each of the split list items:

    lipidomics %>%
      column_values_to_snakecase(metabolite) %>%
      group_split(metabolite) %>%
      map(metabolites_to_wider)
    [[1]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_cd_cl_3_solvent
       <chr>  <chr>  <dbl> <chr>                      <dbl>
     1 ERI109 M         25 CT                          166.
     2 ERI111 M         39 CT                          171.
     3 ERI163 W         58 CT                          262.
     4 ERI375 M         24 CT                          172.
     5 ERI376 M         26 CT                          300.
     6 ERI391 M         31 CT                          241.
     7 ERI392 M         24 CT                          172.
     8 ERI79  W         26 CT                          148.
     9 ERI81  M         52 CT                          168.
    10 ERI83  M         25 CT                          253.
    # … with 26 more rows
    
    [[2]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_cholesterol
       <chr>  <chr>  <dbl> <chr>                  <dbl>
     1 ERI109 M         25 CT                     18.6 
     2 ERI111 M         39 CT                     20.8 
     3 ERI163 W         58 CT                     15.5 
     4 ERI375 M         24 CT                     10.2 
     5 ERI376 M         26 CT                     13.5 
     6 ERI391 M         31 CT                      9.53
     7 ERI392 M         24 CT                      9.87
     8 ERI79  W         26 CT                     17.6 
     9 ERI81  M         52 CT                     17.0 
    10 ERI83  M         25 CT                     19.7 
    # … with 26 more rows
    
    [[3]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_fa_ch_2_ch_2_coo
       <chr>  <chr>  <dbl> <chr>                       <dbl>
     1 ERI109 M         25 CT                           31.6
     2 ERI111 M         39 CT                           28.9
     3 ERI163 W         58 CT                           36.6
     4 ERI375 M         24 CT                           39.4
     5 ERI376 M         26 CT                           52.1
     6 ERI391 M         31 CT                           42.8
     7 ERI392 M         24 CT                           39.9
     8 ERI79  W         26 CT                           32.7
     9 ERI81  M         52 CT                           28.4
    10 ERI83  M         25 CT                           26.5
    # … with 26 more rows
    
    [[4]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_lipid_ch_2
       <chr>  <chr>  <dbl> <chr>                 <dbl>
     1 ERI109 M         25 CT                     587.
     2 ERI111 M         39 CT                     585.
     3 ERI163 W         58 CT                     558.
     4 ERI375 M         24 CT                     606.
     5 ERI376 M         26 CT                     554.
     6 ERI391 M         31 CT                     597.
     7 ERI392 M         24 CT                     607.
     8 ERI79  W         26 CT                     546.
     9 ERI81  M         52 CT                     593.
    10 ERI83  M         25 CT                     606.
    # … with 26 more rows
    
    [[5]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_lipid_ch_3_1
       <chr>  <chr>  <dbl> <chr>                   <dbl>
     1 ERI109 M         25 CT                       44.1
     2 ERI111 M         39 CT                       28.1
     3 ERI163 W         58 CT                       75.1
     4 ERI375 M         24 CT                       22.0
     5 ERI376 M         26 CT                       29.5
     6 ERI391 M         31 CT                       38.0
     7 ERI392 M         24 CT                       34.8
     8 ERI79  W         26 CT                      109. 
     9 ERI81  M         52 CT                       49.6
    10 ERI83  M         25 CT                       29.9
    # … with 26 more rows
    
    [[6]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_lipid_ch_3_2
       <chr>  <chr>  <dbl> <chr>                   <dbl>
     1 ERI109 M         25 CT                       147.
     2 ERI111 M         39 CT                       153.
     3 ERI163 W         58 CT                       144.
     4 ERI375 M         24 CT                       220.
     5 ERI376 M         26 CT                       282.
     6 ERI391 M         31 CT                       220.
     7 ERI392 M         24 CT                       215.
     8 ERI79  W         26 CT                       153.
     9 ERI81  M         52 CT                       150.
    10 ERI83  M         25 CT                       153.
    # … with 26 more rows
    
    [[7]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_mufa_pufa
       <chr>  <chr>  <dbl> <chr>                <dbl>
     1 ERI109 M         25 CT                  50.6  
     2 ERI111 M         39 CT                  53.2  
     3 ERI163 W         58 CT                  60.7  
     4 ERI375 M         24 CT                   0.532
     5 ERI376 M         26 CT                   1.15 
     6 ERI391 M         31 CT                   0.602
     7 ERI392 M         24 CT                   0.422
     8 ERI79  W         26 CT                  36.3  
     9 ERI81  M         52 CT                  40.1  
    10 ERI83  M         25 CT                  39.3  
    # … with 26 more rows
    
    [[8]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_phosphatidycholine
       <chr>  <chr>  <dbl> <chr>                         <dbl>
     1 ERI109 M         25 CT                             41.7
     2 ERI111 M         39 CT                             52.9
     3 ERI163 W         58 CT                             35.3
     4 ERI375 M         24 CT                             66.9
     5 ERI376 M         26 CT                             32.7
     6 ERI391 M         31 CT                             62.9
     7 ERI392 M         24 CT                             64.3
     8 ERI79  W         26 CT                             41.0
     9 ERI81  M         52 CT                             56.1
    10 ERI83  M         25 CT                             57.8
    # … with 26 more rows
    
    [[9]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_phosphatidylethanolamine
       <chr>  <chr>  <dbl> <chr>                               <dbl>
     1 ERI109 M         25 CT                                   6.78
     2 ERI111 M         39 CT                                   3.66
     3 ERI163 W         58 CT                                   3.59
     4 ERI375 M         24 CT                                   3.59
     5 ERI376 M         26 CT                                   2.33
     6 ERI391 M         31 CT                                   1.46
     7 ERI392 M         24 CT                                   2.00
     8 ERI79  W         26 CT                                   4.93
     9 ERI81  M         52 CT                                   5.20
    10 ERI83  M         25 CT                                   5.01
    # … with 26 more rows
    
    [[10]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_phospholipids
       <chr>  <chr>  <dbl> <chr>                    <dbl>
     1 ERI109 M         25 CT                        5.58
     2 ERI111 M         39 CT                        6.16
     3 ERI163 W         58 CT                        5.19
     4 ERI375 M         24 CT                        4.20
     5 ERI376 M         26 CT                        3.27
     6 ERI391 M         31 CT                        4.71
     7 ERI392 M         24 CT                        4.14
     8 ERI79  W         26 CT                        5.70
     9 ERI81  M         52 CT                        5.46
    10 ERI83  M         25 CT                        4.89
    # … with 26 more rows
    
    [[11]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_pufa
       <chr>  <chr>  <dbl> <chr>           <dbl>
     1 ERI109 M         25 CT              29.0 
     2 ERI111 M         39 CT              27.4 
     3 ERI163 W         58 CT              35.5 
     4 ERI375 M         24 CT               6.92
     5 ERI376 M         26 CT               3.22
     6 ERI391 M         31 CT               3.43
     7 ERI392 M         24 CT               3.52
     8 ERI79  W         26 CT              18.7 
     9 ERI81  M         52 CT              20.7 
    10 ERI83  M         25 CT              18.2 
    # … with 26 more rows
    
    [[12]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_tms_interntal_standard
       <chr>  <chr>  <dbl> <chr>                             <dbl>
     1 ERI109 M         25 CT                                208. 
     2 ERI111 M         39 CT                                219. 
     3 ERI163 W         58 CT                                 57.1
     4 ERI375 M         24 CT                                 19.2
     5 ERI376 M         26 CT                                 35.4
     6 ERI391 M         31 CT                                 30.4
     7 ERI392 M         24 CT                                 21.7
     8 ERI79  W         26 CT                                185. 
     9 ERI81  M         52 CT                                207. 
    10 ERI83  M         25 CT                                322. 
    # … with 26 more rows

    Alright, we now a list of data frames where each data frame has only one of the metabolites. These bits of code represent the conceptual action of “splitting the data into a list by metabolites”. Since we’re following a function-oriented workflow, let’s create a function for this. Convert it into a function, add the Roxygen comments (have the cursor inside the function, type Ctrl-Shift-P, then type “roxygen”), run styler (Ctrl-Shift-P, then type “style file”), move into the R/functions.R file, and then source() the file.

    #' Convert the long form dataset into a list of wide form data frames.
    #'
    #' @param data The lipidomics dataset.
    #'
    #' @return A list of data frames.
    #'
    split_by_metabolite <- function(data) {
      data %>%
        column_values_to_snakecase(metabolite) %>%
        dplyr::group_split(metabolite) %>%
        purrr::map(metabolites_to_wider)
    }

    In the doc/lesson.Rmd, use the new function in the code:

    lipidomics %>%
      split_by_metabolite()
    [[1]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_cd_cl_3_solvent
       <chr>  <chr>  <dbl> <chr>                      <dbl>
     1 ERI109 M         25 CT                          166.
     2 ERI111 M         39 CT                          171.
     3 ERI163 W         58 CT                          262.
     4 ERI375 M         24 CT                          172.
     5 ERI376 M         26 CT                          300.
     6 ERI391 M         31 CT                          241.
     7 ERI392 M         24 CT                          172.
     8 ERI79  W         26 CT                          148.
     9 ERI81  M         52 CT                          168.
    10 ERI83  M         25 CT                          253.
    # … with 26 more rows
    
    [[2]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_cholesterol
       <chr>  <chr>  <dbl> <chr>                  <dbl>
     1 ERI109 M         25 CT                     18.6 
     2 ERI111 M         39 CT                     20.8 
     3 ERI163 W         58 CT                     15.5 
     4 ERI375 M         24 CT                     10.2 
     5 ERI376 M         26 CT                     13.5 
     6 ERI391 M         31 CT                      9.53
     7 ERI392 M         24 CT                      9.87
     8 ERI79  W         26 CT                     17.6 
     9 ERI81  M         52 CT                     17.0 
    10 ERI83  M         25 CT                     19.7 
    # … with 26 more rows
    
    [[3]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_fa_ch_2_ch_2_coo
       <chr>  <chr>  <dbl> <chr>                       <dbl>
     1 ERI109 M         25 CT                           31.6
     2 ERI111 M         39 CT                           28.9
     3 ERI163 W         58 CT                           36.6
     4 ERI375 M         24 CT                           39.4
     5 ERI376 M         26 CT                           52.1
     6 ERI391 M         31 CT                           42.8
     7 ERI392 M         24 CT                           39.9
     8 ERI79  W         26 CT                           32.7
     9 ERI81  M         52 CT                           28.4
    10 ERI83  M         25 CT                           26.5
    # … with 26 more rows
    
    [[4]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_lipid_ch_2
       <chr>  <chr>  <dbl> <chr>                 <dbl>
     1 ERI109 M         25 CT                     587.
     2 ERI111 M         39 CT                     585.
     3 ERI163 W         58 CT                     558.
     4 ERI375 M         24 CT                     606.
     5 ERI376 M         26 CT                     554.
     6 ERI391 M         31 CT                     597.
     7 ERI392 M         24 CT                     607.
     8 ERI79  W         26 CT                     546.
     9 ERI81  M         52 CT                     593.
    10 ERI83  M         25 CT                     606.
    # … with 26 more rows
    
    [[5]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_lipid_ch_3_1
       <chr>  <chr>  <dbl> <chr>                   <dbl>
     1 ERI109 M         25 CT                       44.1
     2 ERI111 M         39 CT                       28.1
     3 ERI163 W         58 CT                       75.1
     4 ERI375 M         24 CT                       22.0
     5 ERI376 M         26 CT                       29.5
     6 ERI391 M         31 CT                       38.0
     7 ERI392 M         24 CT                       34.8
     8 ERI79  W         26 CT                      109. 
     9 ERI81  M         52 CT                       49.6
    10 ERI83  M         25 CT                       29.9
    # … with 26 more rows
    
    [[6]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_lipid_ch_3_2
       <chr>  <chr>  <dbl> <chr>                   <dbl>
     1 ERI109 M         25 CT                       147.
     2 ERI111 M         39 CT                       153.
     3 ERI163 W         58 CT                       144.
     4 ERI375 M         24 CT                       220.
     5 ERI376 M         26 CT                       282.
     6 ERI391 M         31 CT                       220.
     7 ERI392 M         24 CT                       215.
     8 ERI79  W         26 CT                       153.
     9 ERI81  M         52 CT                       150.
    10 ERI83  M         25 CT                       153.
    # … with 26 more rows
    
    [[7]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_mufa_pufa
       <chr>  <chr>  <dbl> <chr>                <dbl>
     1 ERI109 M         25 CT                  50.6  
     2 ERI111 M         39 CT                  53.2  
     3 ERI163 W         58 CT                  60.7  
     4 ERI375 M         24 CT                   0.532
     5 ERI376 M         26 CT                   1.15 
     6 ERI391 M         31 CT                   0.602
     7 ERI392 M         24 CT                   0.422
     8 ERI79  W         26 CT                  36.3  
     9 ERI81  M         52 CT                  40.1  
    10 ERI83  M         25 CT                  39.3  
    # … with 26 more rows
    
    [[8]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_phosphatidycholine
       <chr>  <chr>  <dbl> <chr>                         <dbl>
     1 ERI109 M         25 CT                             41.7
     2 ERI111 M         39 CT                             52.9
     3 ERI163 W         58 CT                             35.3
     4 ERI375 M         24 CT                             66.9
     5 ERI376 M         26 CT                             32.7
     6 ERI391 M         31 CT                             62.9
     7 ERI392 M         24 CT                             64.3
     8 ERI79  W         26 CT                             41.0
     9 ERI81  M         52 CT                             56.1
    10 ERI83  M         25 CT                             57.8
    # … with 26 more rows
    
    [[9]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_phosphatidylethanolamine
       <chr>  <chr>  <dbl> <chr>                               <dbl>
     1 ERI109 M         25 CT                                   6.78
     2 ERI111 M         39 CT                                   3.66
     3 ERI163 W         58 CT                                   3.59
     4 ERI375 M         24 CT                                   3.59
     5 ERI376 M         26 CT                                   2.33
     6 ERI391 M         31 CT                                   1.46
     7 ERI392 M         24 CT                                   2.00
     8 ERI79  W         26 CT                                   4.93
     9 ERI81  M         52 CT                                   5.20
    10 ERI83  M         25 CT                                   5.01
    # … with 26 more rows
    
    [[10]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_phospholipids
       <chr>  <chr>  <dbl> <chr>                    <dbl>
     1 ERI109 M         25 CT                        5.58
     2 ERI111 M         39 CT                        6.16
     3 ERI163 W         58 CT                        5.19
     4 ERI375 M         24 CT                        4.20
     5 ERI376 M         26 CT                        3.27
     6 ERI391 M         31 CT                        4.71
     7 ERI392 M         24 CT                        4.14
     8 ERI79  W         26 CT                        5.70
     9 ERI81  M         52 CT                        5.46
    10 ERI83  M         25 CT                        4.89
    # … with 26 more rows
    
    [[11]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_pufa
       <chr>  <chr>  <dbl> <chr>           <dbl>
     1 ERI109 M         25 CT              29.0 
     2 ERI111 M         39 CT              27.4 
     3 ERI163 W         58 CT              35.5 
     4 ERI375 M         24 CT               6.92
     5 ERI376 M         26 CT               3.22
     6 ERI391 M         31 CT               3.43
     7 ERI392 M         24 CT               3.52
     8 ERI79  W         26 CT              18.7 
     9 ERI81  M         52 CT              20.7 
    10 ERI83  M         25 CT              18.2 
    # … with 26 more rows
    
    [[12]]
    # A tibble: 36 × 5
       code   gender   age class metabolite_tms_interntal_standard
       <chr>  <chr>  <dbl> <chr>                             <dbl>
     1 ERI109 M         25 CT                                208. 
     2 ERI111 M         39 CT                                219. 
     3 ERI163 W         58 CT                                 57.1
     4 ERI375 M         24 CT                                 19.2
     5 ERI376 M         26 CT                                 35.4
     6 ERI391 M         31 CT                                 30.4
     7 ERI392 M         24 CT                                 21.7
     8 ERI79  W         26 CT                                185. 
     9 ERI81  M         52 CT                                207. 
    10 ERI83  M         25 CT                                322. 
    # … with 26 more rows

    Like we did with the metabolite_to_wider(), we need to pipe the output into another map() function that has a custom function running the models. We don’t have this function yet, so we need to create it. Let’s convert the modeling code we used in the exercise above into a function, replacing lipidomics with data and using starts_with("metabolite_") within the create_recipe_spec(). Add the Roxygen comments (have the cursor inside the function, type Ctrl-Shift-P, then type “roxygen”), run styler (Ctrl-Shift-P, then type “style file”), move into the R/functions.R file, and then source() the file.

    #' Generate the results of a model
    #'
    #' @param data The lipidomics dataset.
    #'
    #' @return A data frame.
    #'
    generate_model_results <- function(data) {
      create_model_workflow(
        parsnip::logistic_reg() %>%
          parsnip::set_engine("glm"),
        data %>%
          create_recipe_spec(tidyselect::starts_with("metabolite_"))
      ) %>%
        parsnip::fit(data) %>%
        tidy_model_output()
    }

    Then we add it to the end of the pipe, but using map_dfr() to convert to a data frame:

    lipidomics %>%
      split_by_metabolite() %>%
      map_dfr(generate_model_results)
    Warning: glm.fit: algorithm did not converge
    Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    
    Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    # A tibble: 48 × 5
       term                       estimate std.error statistic p.value
       <chr>                         <dbl>     <dbl>     <dbl>   <dbl>
     1 (Intercept)                  0.855     1.57     -0.0995 0.921  
     2 genderW                      3.18      0.943     1.23   0.220  
     3 age                          0.981     0.0478   -0.412  0.680  
     4 metabolite_cd_cl_3_solvent   0.0870    0.865    -2.82   0.00475
     5 (Intercept)                  1.11      1.29      0.0817 0.935  
     6 genderW                      0.493     0.779    -0.907  0.365  
     7 age                          1.01      0.0377    0.183  0.855  
     8 metabolite_cholesterol       2.97      0.458     2.38   0.0175 
     9 (Intercept)                  0.944     1.19     -0.0481 0.962  
    10 genderW                      1.38      0.746     0.428  0.668  
    # … with 38 more rows

    Since we are only interested in the model results for the metabolites, let’s keep only the term rows that are metabolites using filter() and str_detect().

    model_estimates <- lipidomics %>%
      split_by_metabolite() %>%
      map_dfr(generate_model_results) %>%
      filter(str_detect(term, "metabolite_"))
    Warning: glm.fit: algorithm did not converge
    Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    
    Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    model_estimates
    # A tibble: 12 × 5
       term                            estimate std.e…¹ statis…² p.value
       <chr>                              <dbl>   <dbl>    <dbl>   <dbl>
     1 metabolite_cd_cl_3_solvent     8.70e-  2 8.65e-1 -2.82e+0 0.00475
     2 metabolite_cholesterol         2.97e+  0 4.58e-1  2.38e+0 0.0175 
     3 metabolite_fa_ch_2_ch_2_coo    1.52e+  0 3.87e-1  1.09e+0 0.276  
     4 metabolite_lipid_ch_2          2.59e-  3 3.14e+0 -1.90e+0 0.0578 
     5 metabolite_lipid_ch_3_1        4.45e+  1 1.41e+0  2.70e+0 0.00697
     6 metabolite_lipid_ch_3_2        8.85e-  1 3.61e-1 -3.39e-1 0.734  
     7 metabolite_mufa_pufa           4.56e-  1 4.49e-1 -1.75e+0 0.0798 
     8 metabolite_phosphatidycholine  1.28e-120 1.17e+5 -2.37e-3 0.998  
     9 metabolite_phosphatidylethano… 2.69e+  1 1.32e+0  2.49e+0 0.0129 
    10 metabolite_phospholipids       2.39e- 19 6.90e+4 -6.22e-4 1.00   
    11 metabolite_pufa                3.27e+  0 5.60e-1  2.11e+0 0.0345 
    12 metabolite_tms_interntal_stan… 5.62e-  2 9.90e-1 -2.91e+0 0.00363
    # … with abbreviated variable names ¹​std.error, ²​statistic

    Wow! We’re basically at our first targets output! Before continuing, there is one aesthetic thing we can add: The original variable names, rather than the snake case version. Since the original variable still exists in our lipidomics dataset, we can join it to the model_estimates object with right_join(), along with a few other minor changes. First, we’ll select() only the metabolite and then create a duplicate column of metabolite called term (to match the model_estimates) using mutate().

    lipidomics %>%
      select(metabolite) %>%
      mutate(term = metabolite)
    # A tibble: 504 × 2
       metabolite               term                    
       <chr>                    <chr>                   
     1 TMS (interntal standard) TMS (interntal standard)
     2 Cholesterol              Cholesterol             
     3 Lipid CH3- 1             Lipid CH3- 1            
     4 Lipid CH3- 2             Lipid CH3- 2            
     5 Cholesterol              Cholesterol             
     6 Lipid -CH2-              Lipid -CH2-             
     7 FA -CH2CH2COO-           FA -CH2CH2COO-          
     8 PUFA                     PUFA                    
     9 Phosphatidylethanolamine Phosphatidylethanolamine
    10 Phosphatidycholine       Phosphatidycholine      
    # … with 494 more rows

    Right after that we will use our custom column_values_to_snakecase() function on the term column.

    lipidomics %>%
      select(metabolite) %>%
      mutate(term = metabolite) %>%
      column_values_to_snakecase(term)
    # A tibble: 504 × 2
       metabolite               term                    
       <chr>                    <chr>                   
     1 TMS (interntal standard) tms_interntal_standard  
     2 Cholesterol              cholesterol             
     3 Lipid CH3- 1             lipid_ch_3_1            
     4 Lipid CH3- 2             lipid_ch_3_2            
     5 Cholesterol              cholesterol             
     6 Lipid -CH2-              lipid_ch_2              
     7 FA -CH2CH2COO-           fa_ch_2_ch_2_coo        
     8 PUFA                     pufa                    
     9 Phosphatidylethanolamine phosphatidylethanolamine
    10 Phosphatidycholine       phosphatidycholine      
    # … with 494 more rows

    We can see that we are missing the metabolite_ text before each snake case’d name, so we can add that with mutate() and str_c():

    lipidomics %>%
      select(metabolite) %>%
      mutate(term = metabolite) %>%
      column_values_to_snakecase(term) %>%
      mutate(term = str_c("metabolite_", term))
    # A tibble: 504 × 2
       metabolite               term                               
       <chr>                    <chr>                              
     1 TMS (interntal standard) metabolite_tms_interntal_standard  
     2 Cholesterol              metabolite_cholesterol             
     3 Lipid CH3- 1             metabolite_lipid_ch_3_1            
     4 Lipid CH3- 2             metabolite_lipid_ch_3_2            
     5 Cholesterol              metabolite_cholesterol             
     6 Lipid -CH2-              metabolite_lipid_ch_2              
     7 FA -CH2CH2COO-           metabolite_fa_ch_2_ch_2_coo        
     8 PUFA                     metabolite_pufa                    
     9 Phosphatidylethanolamine metabolite_phosphatidylethanolamine
    10 Phosphatidycholine       metabolite_phosphatidycholine      
    # … with 494 more rows

    There are 504 rows, but we only need the unique values of term and metabolite, which we can get with distinct().

    lipidomics %>%
      mutate(term = metabolite) %>%
      column_values_to_snakecase(term) %>%
      mutate(term = str_c("metabolite_", term)) %>%
      distinct(term, metabolite)
    # A tibble: 12 × 2
       metabolite               term                               
       <chr>                    <chr>                              
     1 TMS (interntal standard) metabolite_tms_interntal_standard  
     2 Cholesterol              metabolite_cholesterol             
     3 Lipid CH3- 1             metabolite_lipid_ch_3_1            
     4 Lipid CH3- 2             metabolite_lipid_ch_3_2            
     5 Lipid -CH2-              metabolite_lipid_ch_2              
     6 FA -CH2CH2COO-           metabolite_fa_ch_2_ch_2_coo        
     7 PUFA                     metabolite_pufa                    
     8 Phosphatidylethanolamine metabolite_phosphatidylethanolamine
     9 Phosphatidycholine       metabolite_phosphatidycholine      
    10 Phospholipids            metabolite_phospholipids           
    11 MUFA+PUFA                metabolite_mufa_pufa               
    12 CDCl3 (solvent)          metabolite_cd_cl_3_solvent         

    The last step is to right_join() with the model_estimates:

    lipidomics %>%
      mutate(term = metabolite) %>%
      column_values_to_snakecase(term) %>%
      mutate(term = str_c("metabolite_", term)) %>%
      distinct(term, metabolite) %>%
      right_join(model_estimates, by = "term")
    # A tibble: 12 × 6
       metabolite               term   estimate std.e…¹ statis…² p.value
       <chr>                    <chr>     <dbl>   <dbl>    <dbl>   <dbl>
     1 TMS (interntal standard) meta… 5.62e-  2 9.90e-1 -2.91e+0 0.00363
     2 Cholesterol              meta… 2.97e+  0 4.58e-1  2.38e+0 0.0175 
     3 Lipid CH3- 1             meta… 4.45e+  1 1.41e+0  2.70e+0 0.00697
     4 Lipid CH3- 2             meta… 8.85e-  1 3.61e-1 -3.39e-1 0.734  
     5 Lipid -CH2-              meta… 2.59e-  3 3.14e+0 -1.90e+0 0.0578 
     6 FA -CH2CH2COO-           meta… 1.52e+  0 3.87e-1  1.09e+0 0.276  
     7 PUFA                     meta… 3.27e+  0 5.60e-1  2.11e+0 0.0345 
     8 Phosphatidylethanolamine meta… 2.69e+  1 1.32e+0  2.49e+0 0.0129 
     9 Phosphatidycholine       meta… 1.28e-120 1.17e+5 -2.37e-3 0.998  
    10 Phospholipids            meta… 2.39e- 19 6.90e+4 -6.22e-4 1.00   
    11 MUFA+PUFA                meta… 4.56e-  1 4.49e-1 -1.75e+0 0.0798 
    12 CDCl3 (solvent)          meta… 8.70e-  2 8.65e-1 -2.82e+0 0.00475
    # … with abbreviated variable names ¹​std.error, ²​statistic

    Awesome 😄 Now can you guess what we are going to do next? That’s right, making a function of both the model creation code and this code to add the original variable names. Then we can add our first targets output!

    8.4 Exercise: Creating functions for model results and adding as a target in the pipeline

    Time: ~25 minutes.

    Convert the code that calculates the model estimates as well as the code that adds the original metabolite names into functions. Start with the code for the metabolite names, using the scaffold below as a starting point.

    1. Name the new function add_original_metabolite_names.
    2. Within the function(), add two arguments, where the first is called model_results and the second is called data.
    3. Paste the code we created into the function, replacing lipidomics with data and model_estimates with model_results.
    4. Add dplyr:: and stringr:: before their respective functions.
    5. Add the Roxygen comments (have the cursor inside the function, type Ctrl-Shift-P, then type “roxygen”).
    6. Run styler (Ctrl-Shift-P, then type “style file”) to the file to fix up the code.
    7. Cut and paste the function over into the R/functions.R file.
    8. Commit the changes you’ve made so far.
    ___ <- function(___, ___) {
      ___ %>%
    }
    Click for the solution. Only click if you are struggling or are out of time.
    #' Add the original metabolite names (not as snakecase) to the model results.
    #'
    #' @param model_results The data frame with the model results.
    #' @param data The original lipidomics dataset.
    #'
    #' @return A data frame.
    #'
    add_original_metabolite_names <- function(model_results, data) {
      data %>%
        dplyr::mutate(term = metabolite) %>%
        column_values_to_snakecase(term) %>%
        dplyr::mutate(term = stringr::str_c("metabolite_", term)) %>%
        dplyr::distinct(term, metabolite) %>%
        dplyr::right_join(model_results, by = "term")
    }

    Do the same thing with the code that creates the model results, using the scaffold below as a starting point.

    calculate_estimates <- function(data) {
      ___ %>%
        # All the other code to create the results
        ___ %>% 
        add_original_metabolite_names(data) 
    }
    Click for the solution. Only click if you are struggling or are out of time.
    #' Calculate the estimates for the model for each metabolite.
    #'
    #' @param data The lipidomics dataset.
    #'
    #' @return A data frame.
    #'
    calculate_estimates <- function(data) {
      data %>%
        column_values_to_snakecase(metabolite) %>%
        dplyr::group_split(metabolite) %>%
        purrr::map(metabolites_to_wider) %>%
        purrr::map_dfr(generate_model_results) %>%
        dplyr::filter(stringr::str_detect(term, "metabolite_")) %>%
        add_original_metabolite_names(data)
    }

    Lastly, add the model results output to end of the _targets.R file, using the below scaffold as a guide.

    1. Use df_model_estimates for the name.
    2. Use the calculate_estimates() function in command, with lipidomics as the argument.
    3. Run styler (Ctrl-Shift-P, then type “style file”) and than run targets::tar_visnetwork() (Ctrl-Shift-P, then type “targets visual”) to see if the new target gets detected. If it does, than run targets::tar_make() (Ctrl-Shift-P, then type “targets run”).
    4. Commit the changes to the Git history.
    list(
      ...,
      list(
        name = ___,
        command = ___(___)
      )
    )

    8.5 Visualizing the model estimates

    We’ve got one target done for the modeling stage, three more to go! There are multiple ways of visualizing the results from models. A common approach is to use a “dot-and-whisker” plot like you might see in a meta-analysis. Often the “whisker” part is the measure of uncertainty like the confidence interval, and the “dot” is the estimate. For the confidence interval, we haven’t calculated them at this point because the typical approach doesn’t exactly work for our data (tested before the course). The next section will be covering another way of determining uncertainty. For this plot though, we will use the standard error of the estimate.

    Inside the doc/report.Rmd, let’s create a new header and code chunk inside the ## Results section. We’ll want to use tar_read(df_model_estimates) so that targets is aware that the R Markdown file is dependent on this target.

    ### Figure of model estimates
    
    ```{r}
    model_estimates <- tar_read(df_model_estimates)
    ```

    Then we’ll start using ggplot2 to visualize the results. For dot-whisker plots, the “geom” we would use is called geom_pointrange(). It requires four values:

    • x: This will be the “dot”, representing the estimate column.
    • y: This is the categorical variable that the “dot” is associated with, in this case, it is the metabolite column.
    • xmin: This is the lower end of the “whisker”. Since the std.error is a value representing uncertainty of the estimate on either side of it (+ or -), we will need to subtract std.error from the estimate.
    • xmax: This is the upper end of the “whisker”. Like xmin above, but adding std.error instead.
    plot_estimates <- model_estimates %>% 
      ggplot(aes(
        x = estimate, 
        y = metabolite,
        xmin = estimate - std.error,
        xmax = estimate + std.error
      )) +
      geom_pointrange()
    plot_estimates

    Woah, there is definitely something wrong here. The values of the estimate should realistically be somewhere between 0 (can’t have a negative odds) and 2 (in biology and health research, odds ratios are rarely above 2), definitely unlikely to be more than 5. We will eventually need to troubleshoot this issue, but for now, let’s restrict the x axis to be between 0 and 5.

    plot_estimates +
      coord_fixed(xlim = c(0, 5))

    There are so many things we could start investigating based on these results in order to fix them up. But for now, this will do.

    8.6 Exercise: Add plot function as a target in the pipeline

    Time: ~15 minutes.

    Hopefully you’ve gotten comfortable with the function-oriented workflow, because we’ll need to convert this plot code into a function and add it as a target in the pipeline. Use the scaffold below as a guide.

    1. Replace model_estimates with results.
    2. Move the function into the R/functions.R file, add Roxygen comments (have the cursor inside the function, type Ctrl-Shift-P, then type “roxygen”), and run styler (Ctrl-Shift-P, then type “style file”).
    plot_estimates <- function(results) {
      ___ %>% 
        # Plot code here:
        ___
    }
    Click for the solution. Only click if you are struggling or are out of time.
    #' Plot the estimates and standard errors of the model results.
    #'
    #' @param results The model estimate results.
    #'
    #' @return A ggplot2 figure.
    #'
    plot_estimates <- function(results) {
      results %>%
        ggplot2::ggplot(ggplot2::aes(
          x = estimate, y = metabolite,
          xmin = estimate - std.error,
          xmax = estimate + std.error
        )) +
        ggplot2::geom_pointrange() +
        ggplot2::coord_fixed(xlim = c(0, 5))
    }

    Then, after doing that, add the new function as a target in the pipeline, name the new name as fig_model_estimates.

    list(
      ...,
      tar_target(
        name = ___,
        command = plot_estimates(___)
      )
    )

    And replace all the plot code in the doc/report.Rmd file with the tar_read():

    ```{r}
    tar_read(fig_model_estimates)
    ```

    Run targets::tar_make() (Ctrl-Shift-P, then type “targets run”) to update the pipeline. Then commit the changes to the Git history.

    8.7 Determine variability in model estimates with resampling

    Depending on the type of research questions, there are several ways to assess variability (or uncertainty) in the model results. We could use the calculated standard error of the estimate or calculate the confidence interval from the standard error (using tidy(conf.int = TRUE)). The disadvantage of this approach is that it isn’t very accurate for the data. Plus, when we have such small sample sizes, some issues can limit the use of these typical measures of uncertainty. And we’ve already noticed that there is something strange with the estimates for some of the metabolites.

    So instead, we can use something that is a bit more targeted to the data called “resampling”. There are many resampling techniques, and they all have slightly different uses. The one we will use is called the “bootstrap”2. What bootstrapping does is take the data and randomly resamples it as many times as there are rows. This means you can potentially resample the same data (in our case, person) more than once (called “with replacement”). So by pure chance, you could theoretically have a “resampled set” from lipidomics where all 36 rows are only duplicate data on one person!

  • 2 Another common technique is called “v-fold cross-validation”, which provides a way of assessing how well the model as a whole performs at fitting the data, rather than bootstrap which determines how varied the estimate can be.

  • What’s the advantage of this? It is a way of directly calculating the standard error from the data itself (rather than from a formula), so it gives a more accurate view of how uncertain the model estimate is for our data. Usually, creating between 50 to 100 “resampled sets” is sufficient to calculate a value for the variation. Because running models with bootstrapped sets can take a long time to process, we will only resample 10 times, or less if your computer is slow.

    We’re using the {rsamples} package to handle “resampling”. So let’s add it to our dependency list:

    use_package("rsamples")

    We will eventually run bootstraps on all the metabolites, so we will need to use our split_by_metabolite() function first. For now, we will only use the first item in that list (accessed with [[1]]) to show that the code works without running on all the metabolites every time. Create another code chunk at the bottom of doc/lesson.Rmd to add this code:

    lipidomics_list <- lipidomics %>%
      split_by_metabolite()
    Reading task: ~10 minutes

    You don’t need to run the code in this reading section.

    The bootstraps() function is how we create resampled sets. Since this is done randomly, we should use set.seed() in order for the analysis to be reproducible. Nothing is truly random in computers, and instead is actually “pseudorandom”. In order for our analysis to be reproducible, we use set.seed() to force a specific “pseudorandom” value.

    set.seed(1324)
    bootstraps(lipidomics_list[[1]], times = 10)
    # A tibble: 10 × 2
       splits          id         
       <list>          <chr>      
     1 <split [36/13]> Bootstrap01
     2 <split [36/14]> Bootstrap02
     3 <split [36/12]> Bootstrap03
     4 <split [36/10]> Bootstrap04
     5 <split [36/14]> Bootstrap05
     6 <split [36/15]> Bootstrap06
     7 <split [36/13]> Bootstrap07
     8 <split [36/13]> Bootstrap08
     9 <split [36/12]> Bootstrap09
    10 <split [36/14]> Bootstrap10

    This output is called a “nested tibble”. A nested tibble is a tibble/data frame where one or more of the columns are actually a list object. In our case, each bootstrapped set (marked by the id) has instructions on how the resampled data will look. We can see what it looks like by accessing the splits column and looking at the first item with [[1]]:

    bootstraps(lipidomics_list[[1]], times = 10)$splits[[1]]
    <Analysis/Assess/Total>
    <36/15/36>

    The contents of this resampled set are split into “analysis” sets and “assessment” sets. You don’t need to worry about what these mean or how to use them, since a function we will later use handles it for us. But to give you an idea of what bootstrapping is doing here, we can access one of the sets with either the analysis() or assessment() functions. We’ll arrange() by code to show how we can have duplicate persons when resampling:

    bootstraps(lipidomics_list[[1]], times = 10)$splits[[1]] %>%
      analysis() %>%
      arrange(code)
    # A tibble: 36 × 5
       code   gender   age class metabolite_cd_cl_3_solvent
       <chr>  <chr>  <dbl> <chr>                      <dbl>
     1 ERI140 M         25 T1D                         84.6
     2 ERI140 M         25 T1D                         84.6
     3 ERI142 M         38 T1D                         70.8
     4 ERI143 M         43 T1D                        204. 
     5 ERI144 M         35 T1D                        108. 
     6 ERI144 M         35 T1D                        108. 
     7 ERI144 M         35 T1D                        108. 
     8 ERI145 M         36 T1D                        133. 
     9 ERI145 M         36 T1D                        133. 
    10 ERI145 M         36 T1D                        133. 
    # … with 26 more rows

    See how some code IDs are the same? Those are the same person that has been selected randomly into this resampled set.

    Like we did with the previous modeling, we need to create a workflow object. We’ll use the first metabolite (lipidomics_list[[1]]) for now, but will revise the code to eventually run the bootstrapping on all metabolites.

    workflow_for_bootstrap <- create_model_workflow(
      logistic_reg() %>%
        set_engine("glm"),
      lipidomics_list[[1]] %>%
        create_recipe_spec(starts_with("metabolite_"))
    )

    Previously, we used fit() on the workflow and on the data. Instead, we will use fit_resamples() to run the model on the bootstrapped data. Instead of the data argument in fit(), it is the resamples argument where we provide the bootstraps() sets. We could run the code with only the workflow object and the resampled data, but there’s an extra argument in fit_resamples() that controls some actions taken during the fitting by using the control_resamples() function. For instance, we can save the predictions with save_pred = TRUE and we can process the output with our tidy_model_output() function in the extract argument. So let’s do that. First, both fit_resamples() and control_resamples() come from the tune package, so let’s add it to the dependencies first.

    use_package("tune")

    Now, we can write the code for fit_resamples() on the bootstraps() of the first item in the lipidomics_list and setting the control options with control_resamples().

    bootstrapped_results <- fit_resamples(
      workflow_for_bootstrap,
      resamples = bootstraps(lipidomics_list[[1]], times = 10),
      control = control_resamples(
        extract = tidy_model_output,
        save_pred = TRUE
      )
    )
    bootstrapped_results
    # A tibble: 10 × 6
       splits          id          .metrics .notes   .extracts .predic…¹
       <list>          <chr>       <list>   <list>   <list>    <list>   
     1 <split [36/13]> Bootstrap01 <tibble> <tibble> <tibble>  <tibble> 
     2 <split [36/12]> Bootstrap02 <tibble> <tibble> <tibble>  <tibble> 
     3 <split [36/13]> Bootstrap03 <tibble> <tibble> <tibble>  <tibble> 
     4 <split [36/15]> Bootstrap04 <tibble> <tibble> <tibble>  <tibble> 
     5 <split [36/17]> Bootstrap05 <tibble> <tibble> <tibble>  <tibble> 
     6 <split [36/13]> Bootstrap06 <tibble> <tibble> <tibble>  <tibble> 
     7 <split [36/12]> Bootstrap07 <tibble> <tibble> <tibble>  <tibble> 
     8 <split [36/13]> Bootstrap08 <tibble> <tibble> <tibble>  <tibble> 
     9 <split [36/17]> Bootstrap09 <tibble> <tibble> <tibble>  <tibble> 
    10 <split [36/13]> Bootstrap10 <tibble> <tibble> <tibble>  <tibble> 
    # … with abbreviated variable name ¹​.predictions

    You’ll see that it gives another nested tibble, but with more columns included. Before we start selecting the results that we want, let’s convert the code above into a function, using the function-oriented workflow we’ve used throughout the course.

    #' Generate the model variation results using bootstrap on a single metabolite.
    #'
    #' @param data The lipidomics data.
    #'
    #' @return A nested tibble.
    #'
    generate_model_variation <- function(data) {
      create_model_workflow(
        parsnip::logistic_reg() %>%
          parsnip::set_engine("glm"),
        data %>%
          create_recipe_spec(tidyselect::starts_with("metabolite_"))
      ) %>%
        tune::fit_resamples(
          resamples = rsample::bootstraps(data, times = 10),
          control = tune::control_resamples(
            extract = tidy_model_output,
            save_pred = TRUE
          )
        )
    }

    Re-writing the code to use the function, it becomes:

    bootstrapped_results <- lipidomics_list[[1]] %>%
      generate_model_variation()
    ! Bootstrap08: preprocessor 1/1, model 1/1: glm.fit: fitted probabilities numerically 0 or 1 occurred
    bootstrapped_results
    # A tibble: 10 × 6
       splits          id          .metrics .notes   .extracts .predic…¹
       <list>          <chr>       <list>   <list>   <list>    <list>   
     1 <split [36/15]> Bootstrap01 <tibble> <tibble> <tibble>  <tibble> 
     2 <split [36/13]> Bootstrap02 <tibble> <tibble> <tibble>  <tibble> 
     3 <split [36/15]> Bootstrap03 <tibble> <tibble> <tibble>  <tibble> 
     4 <split [36/13]> Bootstrap04 <tibble> <tibble> <tibble>  <tibble> 
     5 <split [36/15]> Bootstrap05 <tibble> <tibble> <tibble>  <tibble> 
     6 <split [36/15]> Bootstrap06 <tibble> <tibble> <tibble>  <tibble> 
     7 <split [36/10]> Bootstrap07 <tibble> <tibble> <tibble>  <tibble> 
     8 <split [36/13]> Bootstrap08 <tibble> <tibble> <tibble>  <tibble> 
     9 <split [36/14]> Bootstrap09 <tibble> <tibble> <tibble>  <tibble> 
    10 <split [36/13]> Bootstrap10 <tibble> <tibble> <tibble>  <tibble> 
    # … with abbreviated variable name ¹​.predictions

    Let’s explain this output a bit. The fit_resamples() function outputs a nested tibble, where each row is a resampled set. The columns that begin with . (.metrics or .extracts) are extracted details from each model fit to the resampled set. We’ll ignore all but the .extracts column, since that is the column that we set to extract the tidy_model_output(). Let’s select() only the id and the .extracts column and use unnest() to convert the nested tibble to a regular tibble based on the column given.

    bootstrapped_results %>%
      select(id, .extracts) %>%
      unnest(cols = .extracts)
    # A tibble: 10 × 3
       id          .extracts        .config             
       <chr>       <list>           <chr>               
     1 Bootstrap01 <tibble [4 × 5]> Preprocessor1_Model1
     2 Bootstrap02 <tibble [4 × 5]> Preprocessor1_Model1
     3 Bootstrap03 <tibble [4 × 5]> Preprocessor1_Model1
     4 Bootstrap04 <tibble [4 × 5]> Preprocessor1_Model1
     5 Bootstrap05 <tibble [4 × 5]> Preprocessor1_Model1
     6 Bootstrap06 <tibble [4 × 5]> Preprocessor1_Model1
     7 Bootstrap07 <tibble [4 × 5]> Preprocessor1_Model1
     8 Bootstrap08 <tibble [4 × 5]> Preprocessor1_Model1
     9 Bootstrap09 <tibble [4 × 5]> Preprocessor1_Model1
    10 Bootstrap10 <tibble [4 × 5]> Preprocessor1_Model1

    Alright, this is actually another nested tibble (we can see based on the new column .extracts where each row is called a <tibble>). So let’s again unnest() this new .extracts column.

    bootstrapped_results %>%
      select(id, .extracts) %>%
      unnest(cols = .extracts) %>%
      unnest(cols = .extracts)
    # A tibble: 40 × 7
       id          term          estim…¹ std.e…² stati…³ p.value .config
       <chr>       <chr>           <dbl>   <dbl>   <dbl>   <dbl> <chr>  
     1 Bootstrap01 (Intercept)   12.5     1.84     1.37   0.170  Prepro…
     2 Bootstrap01 genderW       15.6     1.10     2.51   0.0122 Prepro…
     3 Bootstrap01 age            0.886   0.0613  -1.97   0.0484 Prepro…
     4 Bootstrap01 metabolite_c…  0.115   0.855   -2.53   0.0113 Prepro…
     5 Bootstrap02 (Intercept)    9.75    1.80     1.27   0.205  Prepro…
     6 Bootstrap02 genderW        3.75    1.16     1.14   0.254  Prepro…
     7 Bootstrap02 age            0.902   0.0656  -1.58   0.115  Prepro…
     8 Bootstrap02 metabolite_c…  0.0243  1.51    -2.46   0.0138 Prepro…
     9 Bootstrap03 (Intercept)    0.0552  1.76    -1.65   0.0995 Prepro…
    10 Bootstrap03 genderW        0.809   0.833   -0.254  0.800  Prepro…
    # … with 30 more rows, and abbreviated variable names ¹​estimate,
    #   ²​std.error, ³​statistic

    Now this is something we are familiar with looking at! It shows the term, the estimate, as well as the bootstrap id. Like before, we only want the metabolite estimate, so we can use filter() and str_detect() like last time, as well as add the original variable names with add_original_metabolite_names().

    bootstrapped_results %>%
      select(id, .extracts) %>%
      unnest(cols = .extracts) %>%
      unnest(cols = .extracts) %>%
      filter(str_detect(term, "metabolite_")) %>%
      add_original_metabolite_names(lipidomics)
    # A tibble: 10 × 8
       metabolite    term  id    estim…¹ std.e…² stati…³ p.value .config
       <chr>         <chr> <chr>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  
     1 CDCl3 (solve… meta… Boot… 1.15e-1   0.855   -2.53 0.0113  Prepro…
     2 CDCl3 (solve… meta… Boot… 2.43e-2   1.51    -2.46 0.0138  Prepro…
     3 CDCl3 (solve… meta… Boot… 2.70e-1   0.776   -1.68 0.0922  Prepro…
     4 CDCl3 (solve… meta… Boot… 5.97e-3   2.42    -2.12 0.0344  Prepro…
     5 CDCl3 (solve… meta… Boot… 1.60e-1   0.779   -2.36 0.0185  Prepro…
     6 CDCl3 (solve… meta… Boot… 7.05e-2   0.988   -2.69 0.00723 Prepro…
     7 CDCl3 (solve… meta… Boot… 1.13e-1   0.847   -2.57 0.0101  Prepro…
     8 CDCl3 (solve… meta… Boot… 7.98e-8  11.6     -1.41 0.159   Prepro…
     9 CDCl3 (solve… meta… Boot… 5.89e-2   1.12    -2.53 0.0115  Prepro…
    10 CDCl3 (solve… meta… Boot… 2.00e-1   0.536   -3.00 0.00268 Prepro…
    # … with abbreviated variable names ¹​estimate, ²​std.error,
    #   ³​statistic

    Using the same workflow as before, let’s convert this into a function:

    #' Tidy up the bootstrap output.
    #'
    #' @param bootstrap_results The bootstrap object with model results.
    #'
    #' @return A data frame.
    #'
    tidy_bootstrap_output <- function(bootstrap_results) {
      bootstrap_results %>%
        dplyr::select(id, .extracts) %>%
        # Need to unnest twice since first `.extracts` is a nest of another two
        # columns of `.extracts` and `.config`.
        tidyr::unnest(cols = .extracts) %>%
        tidyr::unnest(cols = .extracts) %>%
        dplyr::filter(stringr::str_detect(term, "metabolite_")) %>%
        add_original_metabolite_names(lipidomics)
    }

    Then we can start from the beginning again, right from lipidomics, to split_by_metabolite(), to map()’ing with generate_model_variation(), and finally to map_dfr() with tidy_bootstrap_output(). Keep in mind, this will take a while to run.

    metabolites_with_bootstrap_results <- lipidomics %>%
      split_by_metabolite() %>%
      map(generate_model_variation) %>%
      map_dfr(tidy_bootstrap_output)
    metabolites_with_bootstrap_results
    # A tibble: 120 × 8
       metabolite    term  id    estim…¹ std.e…² stati…³ p.value .config
       <chr>         <chr> <chr>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  
     1 CDCl3 (solve… meta… Boot… 0.112     0.870   -2.52 0.0118  Prepro…
     2 CDCl3 (solve… meta… Boot… 0.0208    1.35    -2.88 0.00401 Prepro…
     3 CDCl3 (solve… meta… Boot… 0.0713    0.896   -2.95 0.00321 Prepro…
     4 CDCl3 (solve… meta… Boot… 0.00833   2.36    -2.03 0.0421  Prepro…
     5 CDCl3 (solve… meta… Boot… 0.0475    1.11    -2.74 0.00623 Prepro…
     6 CDCl3 (solve… meta… Boot… 0.0269    1.44    -2.52 0.0118  Prepro…
     7 CDCl3 (solve… meta… Boot… 0.0974    0.804   -2.90 0.00377 Prepro…
     8 CDCl3 (solve… meta… Boot… 0.0153    1.77    -2.36 0.0183  Prepro…
     9 CDCl3 (solve… meta… Boot… 0.0192    1.51    -2.62 0.00873 Prepro…
    10 CDCl3 (solve… meta… Boot… 0.267     0.609   -2.17 0.0300  Prepro…
    # … with 110 more rows, and abbreviated variable names ¹​estimate,
    #   ²​std.error, ³​statistic

    8.8 Exercise: Convert to function and add as a target in the pipeline

    Time: ~15 minutes.

    Continue the workflow we’ve applied throughout the course:

    1. Move the code into a function structure (use the scaffold below as a guide).
    2. Include one argument in the function() called data.
    3. Replace lipidomics in the code with data.
    4. Add the Roxygen comments (have the cursor inside the function, type Ctrl-Shift-P, then type “roxygen”).
    5. Cut and paste the function over into the R/functions.R file.
    6. Run styler (Ctrl-Shift-P, then type “style file”) and lintr::lint_dir() in the R Console (try to fix any issues).
    7. Commit the changes to the Git history.

    Use this code as a guide for the function.

    calculate_variation <- function(___) {
      ___ %>% 
        # Code from above.
        ___
    }
    Click for the solution. Only click if you are struggling or are out of time.
    #' Calculate the uncertainty in results.
    #'
    #' @param data The lipidomics data.
    #'
    #' @return A data frame (or file path)
    #'
    calculate_variation <- function(data) {
      data %>%
        split_by_metabolite() %>%
        purrr::map(generate_model_variation) %>%
        purrr::map_dfr(tidy_bootstrap_output)
    }

    Next, add the function to _targets.R.

    1. Create another tar_target() item in the list() at the bottom of the file.
    2. Use df_model_variation as the name and calculate_variation() as the command with lipidomics as argument.
    3. Run targets::tar_visnetwork() (Ctrl-Shift-P, then type “targets visual”) to see what targets are outdated and then run targets::tar_make() (Ctrl-Shift-P, then type “targets run”).
    4. Commit the changes to the Git history.

    Use this code as a scaffold:

    list(
      ...,
      tar_target(
        name = ___,
        command = ___
      )
    )
    Click for the solution. Only click if you are struggling or are out of time.
    list(
      # ...,
      tar_target(
        name = df_model_variation,
        command = calculate_variation(lipidomics)
      )
    )

    8.9 Visualizing the variability of results

    Take your time explaining why we might use a figure like this, and what you’re trying to show. Since there isn’t much code involved, there is time to explain.

    Like we did with the estimates, let’s visualize the results. Visualizing the estimates was pretty easy, visualizing the variation is even easier. We want to show the range of estimate values across all the bootstrapped models, by metabolite variable. There’s a neat geom in ggplot2 called geom_dotplot() that is similar to a histogram, but instead shows individual data points instead of bars. And since we want to show the variation by metabolite, we can use facet_wrap(). We will use scales = "free" because the range of values for estimate are different for each metabolite.

    metabolites_with_bootstrap_results %>%
      ggplot(aes(x = estimate)) +
      geom_dotplot() +
      facet_wrap(vars(metabolite), scales = "free")
    Bin width defaults to 1/30 of the range of the data. Pick better
    value with `binwidth`.

    This nicely shows the ranges of values in the estimate, really highlighting how uncertain the results are for answering our original research questions. This figure could also be improved quite a bit from a visual and aesthetic point of view, but at least from a content point of view, it shows what we want. For now, we’ll stick with this and finish our last pipeline target before putting everything into the doc/report.Rmd file.

    Let’s use our function workflow with this code:

    1. Create the code as a function, add a data argument, and replace the input data object name with data.
    2. Move the code into the R/functions.R file.
    3. Add the Roxygen comments (have the cursor inside the function, type Ctrl-Shift-P, then type “roxygen”) and fill it in.
    4. Run styler (Ctrl-Shift-P, then type “style file”).
    #' Plot the uncertainty in the estimates of the models.
    #'
    #' @param model_results The model results with the variation.
    #'
    #' @return A ggplot2 image.
    #'
    plot_variation <- function(model_results) {
      model_results %>%
        ggplot2::ggplot(ggplot2::aes(x = estimate)) +
        ggplot2::geom_dotplot() +
        ggplot2::facet_wrap(ggplot2::vars(metabolite), scales = "free")
    }

    Then we’ll add it as a pipeline target to the _targets.R file.

    list(
      ...,
      tar_target(
        name = fig_model_variation,
        command = plot_variation(df_model_variation)
      )
    )

    Run targets::tar_visnetwork() (Ctrl-Shift-P, then type “targets visual”) and then run targets::tar_make() (Ctrl-Shift-P, then type “targets run”). Once everything has been built, commit everything to the Git history.

    8.10 Combine all the output into the R Markdown file

    Now its’ time to add the model results and plots to the doc/report.Rmd file. Open it up and create another code chunk at the bottom of the file. Like we did with the other outputs (like the figures), we’ll use tar_read() to reference the image path.

    ```{r}
    tar_read(fig_model_estimates)
    ```
    
    ```{r}
    tar_read(fig_model_variations)
    ```

    Run targets::tar_visnetwork() (Ctrl-Shift-P, then type “targets visual”), then targets::tar_make() (Ctrl-Shift-P, then type “targets run”). We now have the report rendered to an HTML file! If you open it up in a browser, we can see the figures added to it. In the next session we will get more into making the document output nicer looking and creating it as a website.

    :)

    8.11 Summary

    • Use functional programming with map(), as part of the function-oriented workflow, to run multiple models efficiently and with minimal code.
    • Consistently create small functions that do a specific conceptual action and chain them together into larger conceptual actions, which can then more easily be incorporated into a targets pipeline. Small, multiple functions combined together are easier to manage than fewer, bigger ones.
    • Use dot-whisker plots like geom_pointrange() to visualize the estimates and their standard error.
    • Use resampling techniques like bootstraps(), combined with fit_resamples(), to calculate measures of variation specific to the data. Combine with functionals like map() to run large numbers of models, easily and with minimal code.
    • Visualize variation in data with alternatives to histograms like geom_dotplot().