#> # A tibble: 504 × 6
#> code gender age class metabolite value
#> <chr> <chr> <dbl> <chr> <chr> <dbl>
#> 1 ERI109 M 25 CT tms_interntal_standard 208.
#> 2 ERI109 M 25 CT cholesterol 19.8
#> 3 ERI109 M 25 CT lipid_ch_3_1 44.1
#> 4 ERI109 M 25 CT lipid_ch_3_2 147.
#> 5 ERI109 M 25 CT cholesterol 27.2
#> 6 ERI109 M 25 CT lipid_ch_2 587.
#> 7 ERI109 M 25 CT fa_ch_2_ch_2_coo 31.6
#> 8 ERI109 M 25 CT pufa 29.0
#> 9 ERI109 M 25 CT phosphatidylethanolamine 6.78
#> 10 ERI109 M 25 CT phosphatidycholine 41.7
#> # ℹ 494 more rows
If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.
On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.
8 Efficiently running many analyses at once
Rarely do we run only one single statistical model to answer one single question, especially in our data-overflowing environments. An initial instinct when faced with this task might be to copy-and-paste, then slightly modify the code each time. Or, if you have heard of loops or used them in other programming languages, you might think to create a loop. Thankfully R uses something more powerful and expressive than either of those approaches, and that is functional programming. Using functional programming concepts, we can use little code to express complex actions and run large numbers of statistical analyses. This session will be about using functional programming in the context of statistical analysis.
8.1 Learning objectives
The overall objective for this session is to:
- Describe the basic framework underlying most statistical analyses and use R to generate statistical results using this framework.
More specific objectives are to:
- Recall principles of functional programming and apply them to running statistical analyses by using the purrr package.
- Continue applying the concepts and functions used from the previous sessions.
Specific “anti”-objectives:
- Same as the “anti”-objectives of Chapter 7.
8.2 Apply logistic regression to each metabolite with functional programming
There are many ways that you can run a model on each metabolite based on the lipidomics_wide
dataset. However, these types of “split-apply-combine” tasks are (usually) best done using data in the long form. So we’ll start with the original lipidomics
dataset. Create a header and code chunk at the end of the doc/learning.qmd
file:
The first thing we want to do is convert the metabolite names into snake case:
The next step is to split the data up. We could use group_by()
, but in order to make the most use of purrr functions like map()
, we will use group_split()
to convert the data frame into a set of lists1. Let’s first add purrr as a dependency:
1 There is probably a more computationally efficient way of coding this instead of making a list, but as the saying goes “premature optimization is the root of all evil”. For our purposes, this is a very good approach, but for very large datasets and hundreds of potential models to run, this method would need to be optimized some more.
Then we run group_split()
on the metabolite
column, which will output a lot of data frames as a list. The website only shows the first three.
#> <list_of<
#> tbl_df<
#> code : character
#> gender : character
#> age : double
#> class : character
#> metabolite: character
#> value : double
#> >
#> >[3]>
#> [[1]]
#> # A tibble: 36 × 6
#> code gender age class metabolite value
#> <chr> <chr> <dbl> <chr> <chr> <dbl>
#> 1 ERI109 M 25 CT cd_cl_3_solvent 166.
#> 2 ERI111 M 39 CT cd_cl_3_solvent 171.
#> 3 ERI163 W 58 CT cd_cl_3_solvent 262.
#> 4 ERI375 M 24 CT cd_cl_3_solvent 172.
#> 5 ERI376 M 26 CT cd_cl_3_solvent 300.
#> 6 ERI391 M 31 CT cd_cl_3_solvent 241.
#> 7 ERI392 M 24 CT cd_cl_3_solvent 172.
#> 8 ERI79 W 26 CT cd_cl_3_solvent 148.
#> 9 ERI81 M 52 CT cd_cl_3_solvent 168.
#> 10 ERI83 M 25 CT cd_cl_3_solvent 253.
#> # ℹ 26 more rows
#>
#> [[2]]
#> # A tibble: 108 × 6
#> code gender age class metabolite value
#> <chr> <chr> <dbl> <chr> <chr> <dbl>
#> 1 ERI109 M 25 CT cholesterol 19.8
#> 2 ERI109 M 25 CT cholesterol 27.2
#> 3 ERI109 M 25 CT cholesterol 8.88
#> 4 ERI111 M 39 CT cholesterol 22.8
#> 5 ERI111 M 39 CT cholesterol 30.2
#> 6 ERI111 M 39 CT cholesterol 9.28
#> 7 ERI163 W 58 CT cholesterol 14.9
#> 8 ERI163 W 58 CT cholesterol 24.0
#> 9 ERI163 W 58 CT cholesterol 7.66
#> 10 ERI375 M 24 CT cholesterol 19.2
#> # ℹ 98 more rows
#>
#> [[3]]
#> # A tibble: 36 × 6
#> code gender age class metabolite value
#> <chr> <chr> <dbl> <chr> <chr> <dbl>
#> 1 ERI109 M 25 CT fa_ch_2_ch_2_coo 31.6
#> 2 ERI111 M 39 CT fa_ch_2_ch_2_coo 28.9
#> 3 ERI163 W 58 CT fa_ch_2_ch_2_coo 36.6
#> 4 ERI375 M 24 CT fa_ch_2_ch_2_coo 39.4
#> 5 ERI376 M 26 CT fa_ch_2_ch_2_coo 52.1
#> 6 ERI391 M 31 CT fa_ch_2_ch_2_coo 42.8
#> 7 ERI392 M 24 CT fa_ch_2_ch_2_coo 39.9
#> 8 ERI79 W 26 CT fa_ch_2_ch_2_coo 32.7
#> 9 ERI81 M 52 CT fa_ch_2_ch_2_coo 28.4
#> 10 ERI83 M 25 CT fa_ch_2_ch_2_coo 26.5
#> # ℹ 26 more rows
Remember that logistic regression models need each row to be a single person, so we’ll use the functional map()
to apply our custom function metabolites_to_wider()
on each of the split list items (only showing the first three):
#> [[1]]
#> # A tibble: 36 × 5
#> code gender age class metabolite_cd_cl_3_solvent
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 ERI109 M 25 CT 166.
#> 2 ERI111 M 39 CT 171.
#> 3 ERI163 W 58 CT 262.
#> 4 ERI375 M 24 CT 172.
#> 5 ERI376 M 26 CT 300.
#> 6 ERI391 M 31 CT 241.
#> 7 ERI392 M 24 CT 172.
#> 8 ERI79 W 26 CT 148.
#> 9 ERI81 M 52 CT 168.
#> 10 ERI83 M 25 CT 253.
#> # ℹ 26 more rows
#>
#> [[2]]
#> # A tibble: 36 × 5
#> code gender age class metabolite_cholesterol
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 ERI109 M 25 CT 18.6
#> 2 ERI111 M 39 CT 20.8
#> 3 ERI163 W 58 CT 15.5
#> 4 ERI375 M 24 CT 10.2
#> 5 ERI376 M 26 CT 13.5
#> 6 ERI391 M 31 CT 9.53
#> 7 ERI392 M 24 CT 9.87
#> 8 ERI79 W 26 CT 17.6
#> 9 ERI81 M 52 CT 17.0
#> 10 ERI83 M 25 CT 19.7
#> # ℹ 26 more rows
#>
#> [[3]]
#> # A tibble: 36 × 5
#> code gender age class metabolite_fa_ch_2_ch_2_coo
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 ERI109 M 25 CT 31.6
#> 2 ERI111 M 39 CT 28.9
#> 3 ERI163 W 58 CT 36.6
#> 4 ERI375 M 24 CT 39.4
#> 5 ERI376 M 26 CT 52.1
#> 6 ERI391 M 31 CT 42.8
#> 7 ERI392 M 24 CT 39.9
#> 8 ERI79 W 26 CT 32.7
#> 9 ERI81 M 52 CT 28.4
#> 10 ERI83 M 25 CT 26.5
#> # ℹ 26 more rows
Alright, we now a list of data frames where each data frame has only one of the metabolites. These bits of code represent the conceptual action of “splitting the data into a list by metabolites”. Since we’re following a function-oriented workflow, let’s create a function for this. Convert it into a function, add the Roxygen documentation using Ctrl-Shift-Alt-RCtrl-Shift-Alt-R or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “roxygen comment”) style using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”), move into the R/functions.R
file, and then source()
the file with Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”).
#' Convert the long form dataset into a list of wide form data frames.
#'
#' @param data The lipidomics dataset.
#'
#' @return A list of data frames.
#'
split_by_metabolite <- function(data) {
data %>%
column_values_to_snake_case(metabolite) %>%
dplyr::group_split(metabolite) %>%
purrr::map(metabolites_to_wider)
}
In the doc/learning.qmd
, use the new function in the code:
#> [[1]]
#> # A tibble: 36 × 5
#> code gender age class metabolite_cd_cl_3_solvent
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 ERI109 M 25 CT 166.
#> 2 ERI111 M 39 CT 171.
#> 3 ERI163 W 58 CT 262.
#> 4 ERI375 M 24 CT 172.
#> 5 ERI376 M 26 CT 300.
#> 6 ERI391 M 31 CT 241.
#> 7 ERI392 M 24 CT 172.
#> 8 ERI79 W 26 CT 148.
#> 9 ERI81 M 52 CT 168.
#> 10 ERI83 M 25 CT 253.
#> # ℹ 26 more rows
#>
#> [[2]]
#> # A tibble: 36 × 5
#> code gender age class metabolite_cholesterol
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 ERI109 M 25 CT 18.6
#> 2 ERI111 M 39 CT 20.8
#> 3 ERI163 W 58 CT 15.5
#> 4 ERI375 M 24 CT 10.2
#> 5 ERI376 M 26 CT 13.5
#> 6 ERI391 M 31 CT 9.53
#> 7 ERI392 M 24 CT 9.87
#> 8 ERI79 W 26 CT 17.6
#> 9 ERI81 M 52 CT 17.0
#> 10 ERI83 M 25 CT 19.7
#> # ℹ 26 more rows
#>
#> [[3]]
#> # A tibble: 36 × 5
#> code gender age class metabolite_fa_ch_2_ch_2_coo
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 ERI109 M 25 CT 31.6
#> 2 ERI111 M 39 CT 28.9
#> 3 ERI163 W 58 CT 36.6
#> 4 ERI375 M 24 CT 39.4
#> 5 ERI376 M 26 CT 52.1
#> 6 ERI391 M 31 CT 42.8
#> 7 ERI392 M 24 CT 39.9
#> 8 ERI79 W 26 CT 32.7
#> 9 ERI81 M 52 CT 28.4
#> 10 ERI83 M 25 CT 26.5
#> # ℹ 26 more rows
Like we did with the metabolite_to_wider()
, we need to pipe the output into another map()
function that has a custom function running the models. We don’t have this function yet, so we need to create it. Let’s convert the modeling code we used in the code at the end of section Section 7.8 into a function, replacing lipidomics
with data
and using starts_with("metabolite_")
within the create_recipe_spec()
. Add the Roxygen documentation using Ctrl-Shift-Alt-RCtrl-Shift-Alt-R or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “roxygen comment”), use the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) to style, move into the R/functions.R
file, and then source()
the file with Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”).
#' Generate the results of a model
#'
#' @param data The lipidomics dataset.
#'
#' @return A data frame.
#'
generate_model_results <- function(data) {
create_model_workflow(
parsnip::logistic_reg() %>%
parsnip::set_engine("glm"),
data %>%
create_recipe_spec(tidyselect::starts_with("metabolite_"))
) %>%
parsnip::fit(data) %>%
tidy_model_output()
}
Then we add it to the end of the pipe, but using map()
and list_rbind()
to convert to a data frame:
#> # A tibble: 48 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.855 1.57 -0.0995 0.921
#> 2 genderW 3.18 0.943 1.23 0.220
#> 3 age 0.981 0.0478 -0.412 0.680
#> 4 metabolite_cd_cl_3_solvent 0.0870 0.865 -2.82 0.00475
#> 5 (Intercept) 1.11 1.29 0.0817 0.935
#> 6 genderW 0.493 0.779 -0.907 0.365
#> 7 age 1.01 0.0377 0.183 0.855
#> 8 metabolite_cholesterol 2.97 0.458 2.38 0.0175
#> 9 (Intercept) 0.944 1.19 -0.0481 0.962
#> 10 genderW 1.38 0.746 0.428 0.668
#> # ℹ 38 more rows
Since we are only interested in the model results for the metabolites, let’s keep only the term
rows that are metabolites using filter()
and str_detect()
.
model_estimates <- lipidomics %>%
split_by_metabolite() %>%
map(generate_model_results) %>%
list_rbind() %>%
filter(str_detect(term, "metabolite_"))
model_estimates
#> # A tibble: 12 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 metabolite_cd_cl_3_solvent 8.70e- 2 0.865 -2.82 0.00475
#> 2 metabolite_cholesterol 2.97e+ 0 0.458 2.38 0.0175
#> 3 metabolite_fa_ch_2_ch_2_coo 1.52e+ 0 0.387 1.09 0.276
#> 4 metabolite_lipid_ch_2 2.59e- 3 3.14 -1.90 0.0578
#> 5 metabolite_lipid_ch_3_1 4.45e+ 1 1.41 2.70 0.00697
#> 6 metabolite_lipid_ch_3_2 8.85e- 1 0.361 -0.339 0.734
#> 7 metabolite_mufa_pufa 4.56e- 1 0.449 -1.75 0.0798
#> 8 metabolite_phosphatidycholine 1.28e-120 116628. -0.00237 0.998
#> 9 metabolite_phosphatidylethanolamine 2.69e+ 1 1.32 2.49 0.0129
#> 10 metabolite_phospholipids 2.39e- 19 68964. -0.000622 1.00
#> 11 metabolite_pufa 3.27e+ 0 0.560 2.11 0.0345
#> 12 metabolite_tms_interntal_standard 5.62e- 2 0.990 -2.91 0.00363
Wow! We’re basically at our first targets output! Before continuing, there is one aesthetic thing we can add: The original variable names, rather than the snake case version. Since the original variable still exists in our lipidomics
dataset, we can join it to the model_estimates
object with right_join()
, along with a few other minor changes. First, we’ll select()
only the metabolite
and then create a duplicate column of metabolite
called term
(to match the model_estimates
) using mutate()
.
#> # A tibble: 504 × 2
#> metabolite term
#> <chr> <chr>
#> 1 TMS (interntal standard) TMS (interntal standard)
#> 2 Cholesterol Cholesterol
#> 3 Lipid CH3- 1 Lipid CH3- 1
#> 4 Lipid CH3- 2 Lipid CH3- 2
#> 5 Cholesterol Cholesterol
#> 6 Lipid -CH2- Lipid -CH2-
#> 7 FA -CH2CH2COO- FA -CH2CH2COO-
#> 8 PUFA PUFA
#> 9 Phosphatidylethanolamine Phosphatidylethanolamine
#> 10 Phosphatidycholine Phosphatidycholine
#> # ℹ 494 more rows
Right after that we will use our custom column_values_to_snake_case()
function on the term
column.
lipidomics %>%
select(metabolite) %>%
mutate(term = metabolite) %>%
column_values_to_snake_case(term)
#> # A tibble: 504 × 2
#> metabolite term
#> <chr> <chr>
#> 1 TMS (interntal standard) tms_interntal_standard
#> 2 Cholesterol cholesterol
#> 3 Lipid CH3- 1 lipid_ch_3_1
#> 4 Lipid CH3- 2 lipid_ch_3_2
#> 5 Cholesterol cholesterol
#> 6 Lipid -CH2- lipid_ch_2
#> 7 FA -CH2CH2COO- fa_ch_2_ch_2_coo
#> 8 PUFA pufa
#> 9 Phosphatidylethanolamine phosphatidylethanolamine
#> 10 Phosphatidycholine phosphatidycholine
#> # ℹ 494 more rows
We can see that we are missing the metabolite_
text before each snake case’d name, so we can add that with mutate()
and str_c()
:
lipidomics %>%
select(metabolite) %>%
mutate(term = metabolite) %>%
column_values_to_snake_case(term) %>%
mutate(term = str_c("metabolite_", term))
#> # A tibble: 504 × 2
#> metabolite term
#> <chr> <chr>
#> 1 TMS (interntal standard) metabolite_tms_interntal_standard
#> 2 Cholesterol metabolite_cholesterol
#> 3 Lipid CH3- 1 metabolite_lipid_ch_3_1
#> 4 Lipid CH3- 2 metabolite_lipid_ch_3_2
#> 5 Cholesterol metabolite_cholesterol
#> 6 Lipid -CH2- metabolite_lipid_ch_2
#> 7 FA -CH2CH2COO- metabolite_fa_ch_2_ch_2_coo
#> 8 PUFA metabolite_pufa
#> 9 Phosphatidylethanolamine metabolite_phosphatidylethanolamine
#> 10 Phosphatidycholine metabolite_phosphatidycholine
#> # ℹ 494 more rows
There are 504 rows, but we only need the unique values of term
and metabolite
, which we can get with distinct()
. Because we will use distinct()
, we don’t need to use select()
as well, since it keeps only the metabolite
and term
variables.
lipidomics %>%
mutate(term = metabolite) %>%
column_values_to_snake_case(term) %>%
mutate(term = str_c("metabolite_", term)) %>%
distinct(term, metabolite)
#> # A tibble: 12 × 2
#> term metabolite
#> <chr> <chr>
#> 1 metabolite_tms_interntal_standard TMS (interntal standard)
#> 2 metabolite_cholesterol Cholesterol
#> 3 metabolite_lipid_ch_3_1 Lipid CH3- 1
#> 4 metabolite_lipid_ch_3_2 Lipid CH3- 2
#> 5 metabolite_lipid_ch_2 Lipid -CH2-
#> 6 metabolite_fa_ch_2_ch_2_coo FA -CH2CH2COO-
#> 7 metabolite_pufa PUFA
#> 8 metabolite_phosphatidylethanolamine Phosphatidylethanolamine
#> 9 metabolite_phosphatidycholine Phosphatidycholine
#> 10 metabolite_phospholipids Phospholipids
#> 11 metabolite_mufa_pufa MUFA+PUFA
#> 12 metabolite_cd_cl_3_solvent CDCl3 (solvent)
The last step is to right_join()
with the model_estimates
:
lipidomics %>%
mutate(term = metabolite) %>%
column_values_to_snake_case(term) %>%
mutate(term = str_c("metabolite_", term)) %>%
distinct(term, metabolite) %>%
right_join(model_estimates, by = "term")
#> # A tibble: 12 × 6
#> term metabolite estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 metabolite_tms_interntal_st… TMS (inte… 5.62e- 2 9.90e-1 -2.91 0.00363
#> 2 metabolite_cholesterol Cholester… 2.97e+ 0 4.58e-1 2.38 0.0175
#> 3 metabolite_lipid_ch_3_1 Lipid CH3… 4.45e+ 1 1.41e+0 2.70 0.00697
#> 4 metabolite_lipid_ch_3_2 Lipid CH3… 8.85e- 1 3.61e-1 -0.339 0.734
#> 5 metabolite_lipid_ch_2 Lipid -CH… 2.59e- 3 3.14e+0 -1.90 0.0578
#> 6 metabolite_fa_ch_2_ch_2_coo FA -CH2CH… 1.52e+ 0 3.87e-1 1.09 0.276
#> 7 metabolite_pufa PUFA 3.27e+ 0 5.60e-1 2.11 0.0345
#> 8 metabolite_phosphatidyletha… Phosphati… 2.69e+ 1 1.32e+0 2.49 0.0129
#> 9 metabolite_phosphatidycholi… Phosphati… 1.28e-120 1.17e+5 -0.00237 0.998
#> 10 metabolite_phospholipids Phospholi… 2.39e- 19 6.90e+4 -0.000622 1.00
#> 11 metabolite_mufa_pufa MUFA+PUFA 4.56e- 1 4.49e-1 -1.75 0.0798
#> 12 metabolite_cd_cl_3_solvent CDCl3 (so… 8.70e- 2 8.65e-1 -2.82 0.00475
Awesome 😄 Now can you guess what we are going to do next? That’s right, making a function of both the model creation code and this code to add the original variable names. Then we can add our first targets output!
8.3 Exercise: Creating functions for model results and adding as a target in the pipeline
Time: ~25 minutes.
There are two parts to this exercise and two solution chunks.
Part 1: Convert the code that calculates the model estimates as well as the code that adds the original metabolite names into functions. Start with the code for the metabolite names, using the scaffold below as a starting point.
- Name the new function
add_original_metabolite_names
. - Within the
function()
, add two arguments, where the first is calledmodel_results
and the second is calleddata
. - Paste the code we created into the function, replacing
lipidomics
withdata
andmodel_estimates
withmodel_results
. - Add
dplyr::
andstringr::
before their respective functions. - Add the Roxygen documentation using Ctrl-Shift-Alt-RCtrl-Shift-Alt-R or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “roxygen comment”).
- Use the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) to style the file to fix up the code.
- Cut and paste the function over into the
R/functions.R
file. - Commit the changes you’ve made so far with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
Click for the solution. Only click if you are struggling or are out of time.
#' Add the original metabolite names (not as snakecase) to the model results.
#'
#' @param model_results The data frame with the model results.
#' @param data The original, unprocessed lipidomics dataset.
#'
#' @return A data frame.
#'
add_original_metabolite_names <- function(model_results, data) {
data %>%
dplyr::mutate(term = metabolite) %>%
column_values_to_snake_case(term) %>%
dplyr::mutate(term = stringr::str_c("metabolite_", term)) %>%
dplyr::distinct(term, metabolite) %>%
dplyr::right_join(model_results, by = "term")
}
Part 2: Do the same thing with the code that creates the model results, using the scaffold below as a starting point.
calculate_estimates <- function(data) {
___ %>%
# All the other code to create the results
___ %>%
add_original_metabolite_names(data)
}
Click for the solution. Only click if you are struggling or are out of time.
#' Calculate the estimates for the model for each metabolite.
#'
#' @param data The lipidomics dataset.
#'
#' @return A data frame.
#'
calculate_estimates <- function(data) {
data %>%
split_by_metabolite() %>%
purrr::map(generate_model_results) %>%
purrr::list_rbind() %>%
dplyr::filter(stringr::str_detect(term, "metabolite_")) %>%
add_original_metabolite_names(data)
}
Lastly, add the model results output to end of the _targets.R
file, using the below scaffold as a guide.
- Use
df_model_estimates
for thename
. - Use the
calculate_estimates()
function incommand
, withlipidomics
as the argument. - Use the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) to style and than run
targets::tar_visnetwork()
using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets visual”) to see if the new target gets detected. If it does, than runtargets::tar_make()
with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets run”). - Commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
8.4 Visualizing the model estimates
We’ve got one target done for the modeling stage, three more to go! There are multiple ways of visualizing the results from models. A common approach is to use a “dot-and-whisker” plot like you might see in a meta-analysis. Often the “whisker” part is the measure of uncertainty like the confidence interval, and the “dot” is the estimate. For the confidence interval, we haven’t calculated them at this point because the typical approach doesn’t exactly work for our data (tested before the course). For this plot though, we will use the standard error of the estimate.
Inside the doc/learning.qmd
, let’s create a new header and code chunk inside the ## Results
section. We’ll want to use tar_read(df_model_estimates)
so that targets is aware that the R Markdown file is dependent on this target.
Then we’ll start using ggplot2 to visualize the results. For dot-whisker plots, the “geom” we would use is called geom_pointrange()
. It requires four values:
-
x
: This will be the “dot”, representing theestimate
column. -
y
: This is the categorical variable that the “dot” is associated with, in this case, it is themetabolite
column. -
xmin
: This is the lower end of the “whisker”. Since thestd.error
is a value representing uncertainty of the estimate on either side of it (+
or-
), we will need to subtractstd.error
from theestimate
. -
xmax
: This is the upper end of the “whisker”. Likexmin
above, but addingstd.error
instead.
plot_estimates <- model_estimates %>%
ggplot(aes(
x = estimate,
y = metabolite,
xmin = estimate - std.error,
xmax = estimate + std.error
)) +
geom_pointrange()
plot_estimates
Woah, there is definitely something wrong here. The values of the estimate should realistically be somewhere between 0 (can’t have a negative odds) and 2 (in biology and health research, odds ratios are rarely above 2), definitely unlikely to be more than 5. We will eventually need to troubleshoot this issue, but for now, let’s restrict the x axis to be between 0 and 5.
There are so many things we could start investigating based on these results in order to fix them up. But for now, this will do.
8.5 Exercise: Add plot function as a target in the pipeline
Time: ~15 minutes.
Hopefully you’ve gotten comfortable with the function-oriented workflow, because we’ll need to convert this plot code into a function and add it as a target in the pipeline. Use the scaffold below as a guide.
- Replace
model_estimates
withresults
. - Move the function into the
R/functions.R
file, add the Roxygen documentation using Ctrl-Shift-Alt-RCtrl-Shift-Alt-R or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “roxygen comment”), and use the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) to style.
Click for the solution. Only click if you are struggling or are out of time.
#' Plot the estimates and standard errors of the model results.
#'
#' @param results The model estimate results.
#'
#' @return A ggplot2 figure.
#'
plot_estimates <- function(results) {
results %>%
ggplot2::ggplot(ggplot2::aes(
x = estimate, y = metabolite,
xmin = estimate - std.error,
xmax = estimate + std.error
)) +
ggplot2::geom_pointrange() +
ggplot2::coord_fixed(xlim = c(0, 5))
}
Then, after doing that, add the new function as a target in the pipeline, name the new name
as fig_model_estimates
. Inside the plot_estimates()
function, use the the model estimate target we created previously (df_model_estimates
).
8.6 Combine all the output into the Quarto file
Now its’ time to add the model results and plots to the doc/learning.qmd
file. Open it up and create another code chunk at the bottom of the file. Like we did with the other outputs (like the figures), we’ll use tar_read()
to reference the image path.
Run targets::tar_visnetwork()
using the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets visual”), then targets::tar_make()
with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “targets run”). We now have the report rendered to an HTML file! If you open it up in a browser, we can see the figures added to it. Before ending, commit the changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
8.7 Summary
- Use functional programming with
map()
, as part of the function-oriented workflow, to run multiple models efficiently and with minimal code. - Consistently create small functions that do a specific conceptual action and chain them together into larger conceptual actions, which can then more easily be incorporated into a targets pipeline. Small, multiple functions combined together are easier to manage than fewer, bigger ones.
- Use dot-whisker plots like
geom_pointrange()
to visualize the estimates and their standard error.