7  A general approach to doing statistical analyses

When working with data in research it is almost impossible to avoid needing to do some form of statistical analysis. Running statistical analyses is usually methodical and well-defined, though it often involves trial and error. As it involves several steps: choosing the right analysis to do for your data, ensuring that your data is transformed in the right way to do your chosen statistical analysis, and lastly, extracting the results of your statistical analysis so you can present them. Sometimes it may feel overwhelming and complicated, which it definitely can be, but it doesn’t have to be. By taking a structured approach, you can make the process easier and feel more in control.

In R, statistical methods are often developed by researchers with little software training, which can make learning new methods challenging. However, having a framework for running analyses can simplify the process.

For this session, we are going to learn how we can generally set up and run statistical models, so that we can use it within targets.

7.1 Learning objectives

This session’s overall learning outcome is to:

  1. Describe the basic framework underlying most statistical analyses and use R to generate statistical results using this framework.

Specific objectives are to:

  1. Describe the general β€œworkflow”: State the research question, construct a model, prepare the data, fit the model, and extract the results.
  2. Categorize the model definition step as a distinct, theory-driven step, separate from the data, and use parsnip functions to help with defining your model.
  3. Identify some data transformation techniques and evaluate which are good options given the data. Use functions in the recipes package to apply these transformations to your data.
  4. Use the broom package to extract the model results to later present them in graphical format (with ggplot2).
  5. Continue applying the concepts and functions used from the previous sessions.

Specific β€œanti”-objectives on things that we will not learn:

  • How to choose and apply the appropriate statistical model or tests.
  • Statistical theory.
  • How to determine relevant data transformations for statistical tests.
  • How to interpret the statistical results.

This means that what we show is for demonstration purposes only. Our implementation and choice of model and analyses could be wrong if an expert were to review them.

Warning

We will be making a lot of function throughout this session and the next. This is just a fair warning!

7.2 πŸ’¬ Discussion activity: What does a β€œmodel” mean?

Time: ~6 minutes.

In science and especially statistics, we talk a lot about β€œmodels”. But what does model actually mean? What different types of definitions can you think of? Is there a different understanding of model in statistics compared to other areas?

  1. Take 1 minute to think about your understanding of a model.
  2. Then, over the next 3 minutes, discuss with your neighbour about the meaning of β€œmodel” and see if you can come to a shared understanding.
  3. Finally, over the next 2 minutes, we will share all together what a model is in the context of data analysis.

7.3 A brief introduction to statistical modelling and workflow

Let them read it over than go over it briefly, focusing on what a model is, that we should create research questions while thinking in the framework of models, and the general workflow for doing statistical analyses.

πŸ“– Reading task: ~15 minutes

Research, especially quantitative research, is about making inferences about the world by quantifying uncertainty from some data. And we quantify uncertainty by using statistics and statistical models.

A statistical model is a simple way to describe the real world using mathematics. In research, we often aim to understand the relationships between multiple variables. For example, if we change the predictor (independent variable or \(x\)), how does that affect the outcome (dependent variable or \(y\))?

Some simple examples are:

  • β€œHow does this new cancer treatment affect survival rates?”
  • β€œHow does this new user interface affect user watch time?”

or more complex:

  • β€œHow does sunlight and water affect plant growth?”
  • β€œWhat is the relationship between a metabolite and type 1 diabetes compared to controls, adjusting for age and gender?”

These relationships can involve single or multiple predictors (such as sunlight and water, or metabolite, gender, and age). The outcomes can be either continuous (such as plant growth) or categorical (such as whether a person has type 1 diabetes or not).

The relationship between predictors and the outcome is known as a regression. If there is a relationship between (\(x\)) and (\(y\)), the corresponding mathematical object is a function, \(f()\):

\[y = f(x)\]

Since no relationship is perfect, we add some random error (error) to describe the difference between the function and the actual measured variables:

\[y = f(x) + error\]

The simplest form of regression is linear regression, where the formula is a linear combination:

\[y = intercept + x + error\]

Here, the \(intercept\) is the value of \(y\) when \(x\) is zero.

For example, we might expect a relationship between plant growth, water, and sunlight. We could model plant growth as depending on water and sunlight. Graphically, we can illustrate this model as:

graph LR
    Sunlight --> Growth
    Water --> Growth
    Sunlight --- Water
Figure 7.1: Simple example of a theoretical model of plant growth.

or mathematically:

\[growth = sunlight + water\]

This expression is known as our theoretical model. However, not all theoretical models can be tested against the real world. We need to structure the theoretical model so that we can measure the parametersβ€” the variables like growth, sunlight, and water in the model aboveβ€” involved. In this case, we need to measure plant growth (e.g., weight in grams), the amount of water given per day (in liters), and the number of sunlight hours per day.

This also means that when formulating a research question, there is an order, or workflow, to follow:

  1. Write a research question that fits a theoretical model with measurable parameters.

  2. Select the best model type based on the parameters (e.g., linear regression, logistic regression, ANOVA, or t-test).

  3. Measure your parameters (e.g., water in liters per day, plant growth in weight, sunlight in hours per day).

  4. Fit the data to the model to estimate the values (coefficients) of the model, such as the intercept, slope, and uncertainty (error).

  5. Extract and present the values in relation to your research questions.

graph TD
    A[Research question] --> B((Statistical model))
    B --> A
    B --> C[Data collection]
    C --> D[Data transformation]
    D --> E[Scientific output]
    E --> A
Figure 7.2: Simple schematic of the workflow for conducting statistical analysis.
Caution

The entire workflow for building statistical models requires highly specific domain knowledge on not only the statistics themselves, but also how the data was collected, what the values mean, what type of research questions to ask and how to ask them, how to interpret the results from the models, and how to process the data to fit the question and model.

For instance, in our lipidomics dataset, if we were to actually use this data, we would need someone familiar with -omic technologies, how the data are measured, what the values actually mean, how to prepare them for the modeling, the specific modeling methods used for this field, and how we would actually interpret the findings. We have none of these things, so very likely we are doing things quite wrong here. We’re only doing this modeling to highlight how to use the R packages.

Furthermore, because we are working within a β€œreproducible analysis” framework specifically with the use of targets, we will also convert questions regarding our lipidomics data into outputs to include as pipeline targets, along with a basic idea for the final functions that will make up these targets and their inputs and outputs. These targets will probably be quite different by the end, but it’s a good start to think about what it should look like in the end.

  • All results for estimated relationships (in case we want to use it for other output)
  • Plot of statistical estimate for each relationship

Potential function names might be:

  • calculate_estimates()
  • plot_estimates()
graph TB
    lipidomics -- "calculate_estimates()" --> model_est[Model estimates]
    model_est -- "plot_estimates()" --> plot_est[Plot of estimates]
    plot_est -- "tar_read()" --> qmd[Quarto]
Figure 7.3: Potential inputs, outputs, and functions for the targets pipeline.
Finished? πŸ‘’

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the instructor πŸ‘’ 🎩

A few things to repeat and reinforce:

  1. The workflow of the image and that it all starts with the research question.
  2. The fact that almost all statistical methods are basically special forms of linear regression.
  3. That this model creation stage requires a variety of domain expertise, not just statistical expertise.

Also repeat the question to ask, the outputs, as well as the functions and their flow that we can translate into the targets pipeline.

7.4 Defining a model for our lipidomics data

Let them read it first and then briefly verbally walk through this section, describing the theoretical model both graphically and mathematically. Go through why we use tidymodels rather than other approaches.

πŸ“– Reading task: ~10 minutes

Going back to our own lipidomics dataset, we need to do the first step: Creating the question. While we don’t have much data, there are a surprising number of questions we could ask. But we will keep it very simple, very basic, and very exploratory.

  1. What is the estimated relationship of each metabolite with type 1 diabetes (T1D) compared to the controls, adjusting for the influence of age and gender?

A graphical representation of this theoretical model could be:

graph TB
    Metabolite --> T1D
    Age & Gender --> T1D & Metabolite
Figure 7.4: A simple theoretical model of the research question about T1D status.

Or mathematically:

\[T1D = metabolite + age + gender\]

So, T1D status (or class in the lipidomics dataset) is our outcome and the individual metabolite, age, and gender are our predictors. Technically, age and gender would be β€œconfounders” or β€œcovariates”, since we include them only because we think they influence the relationship between the metabolite and T1D.

If we convert the formula into a form with the variables we have in the dataset as well as selecting only one metabolite for now (the cholesterol metabolite, which we add β€œmetabolite” to differentiate it from other potential variables), it would be:

\[class = metabolite\_cholesterol + age + gender\]

Now that we have a theoretical model, we need to choose our model type. Since T1D is binary (either you have it or you don’t), the most likely choice is logistic regression, which requires a binary outcome variable. So we have the theoretical model and the type of model to use, now how do we express this as code in R? There are many ways of doing the same thing in R, but some are a bit easier than others.

One approach that is quite generic and fits with the ideals of the tidyverse, is a similar universe of packages called the tidymodels. We teach tidymodels because they are created by experienced software developers at Posit, the same company behind the tidyverse and RStudio. They are known for excellent documentation. tidymodels makes it easy to create and use models in a flexible way, allowing you to switch model types or data processing methods without needing to learn new tools. This makes your work more efficient and adaptable.

tidymodels is designed to clearly separate the different parts of the model-building process. Each component of the workflow has its own specific package, making it easier to manage and understand. This modular approach allows you to change one part of the process, like the model type or data processing method, without having to learn a completely new set of tools. This design makes tidymodels both flexible and user-friendly.

Core packages within tidymodels.
Package Description
parsnip Define the model, such as type (e.g. linear_reg()) and β€œengine” (e.g. glm()).
recipes Model-specific data transformations, such as removing missing values, or standardizing the data.
workflows Combining model definition, data, and transformations to calculate the estimates and uncertainty.

We’ll start with the parsnip package. Functions in this package are used to set the details of the model you want to use. Specifically, functions to indicate the model β€œtype” (e.g. linear regression) and the set_engines() function to determine the β€œengine” to run the type (which R-based algorithm to use, like glm() compared to lm()). Check out the Examples page for code you might use depending on the model you want. The most commonly used model types would be linear_reg(), logistic_reg(), and multinom_reg().

Finished? πŸ‘’

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the instructor πŸ‘’ 🎩

Let’s switch to working in the doc/learning.qmd file to create those logistic regression models. First, we need to add library(tidymodels) to the setup code chunk. Copy and paste this code chunk below into the Quarto file.

doc/learning.qmd
```{r setup}
targets::tar_config_set(store = here::here("_targets"))
library(tidyverse)
library(targets)
library(tidymodels)
source(here::here("R/functions.R"))
lipidomics <- tar_read(lipidomics)
```

Since we will be using tidymodels, we need to install it, as well as explicitly add the parsnip, recipes, and workflows packages. Like tidyverse, we need to set tidymodels differently because it is a β€œmeta-package”. We might need to force installing it with pak::pak("tidymodels").

Console
use_package("tidymodels", "depends")
# pak::pak("tidymodels")
use_package("parsnip")
use_package("recipes")
use_package("workflows")

Before continuing, let’s commit the changes to the Git history with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type β€œcommit”). Next, in the doc/learning.qmd file, on the bottom of the document create a new header and code chunk:

doc/learning.qmd
## Building the model

```{r}

```

In the new code chunk, we will set up the model specs:

doc/learning.qmd
log_reg_specs <- logistic_reg() |>
  set_engine("glm")
log_reg_specs
Logistic Regression Model Specification (classification)

Computational engine: glm 

Running this on it’s own doesn’t show much, as you can see. But we’ve now set the model we want to use.

7.5 Data transformations specific to modeling

Setting the model type was pretty easy right? The more difficult part comes next with the data transformations. recipes functions are almost entirely used to apply transformations that a model might specifically need, like mean-centering, removing missing values, and other aspects of data processing.

Let’s consider our lipidomics dataset. In order for us to start our statistical analysis, we need the data to be structured in a certain way to be able to smoothly use it as input in our model. We have at least three necessary transformations, two of which can be done using a single tidyr function, while the third one can be fixed with recipes. Can you spot them?

Console
lipidomics
# A tibble: 504 Γ— 6
   code   gender   age class metabolite                value
   <chr>  <chr>  <dbl> <chr> <chr>                     <dbl>
 1 ERI109 M         25 CT    TMS (interntal standard) 208.  
 2 ERI109 M         25 CT    Cholesterol               19.8 
 3 ERI109 M         25 CT    Lipid CH3- 1              44.1 
 4 ERI109 M         25 CT    Lipid CH3- 2             147.  
 5 ERI109 M         25 CT    Cholesterol               27.2 
 6 ERI109 M         25 CT    Lipid -CH2-              587.  
 7 ERI109 M         25 CT    FA -CH2CH2COO-            31.6 
 8 ERI109 M         25 CT    PUFA                      29.0 
 9 ERI109 M         25 CT    Phosphatidylethanolamine   6.78
10 ERI109 M         25 CT    Phosphatidycholine        41.7 
# β„Ή 494 more rows

Ask them if they can spot the structural data issues. Give them a few minutes to think and respond.

The first issue, which isn’t always an issue, depends heavily on the model type you use. Since we are using logistic regression, the model assumes that each row is an individual person. But our data is in the long format, so each person has multiple rows. The second issue is that there seems to be a data entry error, since there are multiple Cholesterol values, while all other metabolites only have one:

doc/learning.qmd
lipidomics |>
  count(code, metabolite) |>
  filter(n > 1)
# A tibble: 36 Γ— 3
   code   metabolite      n
   <chr>  <chr>       <int>
 1 ERI109 Cholesterol     3
 2 ERI111 Cholesterol     3
 3 ERI140 Cholesterol     3
 4 ERI142 Cholesterol     3
 5 ERI143 Cholesterol     3
 6 ERI144 Cholesterol     3
 7 ERI145 Cholesterol     3
 8 ERI146 Cholesterol     3
 9 ERI147 Cholesterol     3
10 ERI149 Cholesterol     3
# β„Ή 26 more rows

We can fix both the long format and multiple cholesterol issues by using tidyr::pivot_wider(). Before we do, the last issue is that each metabolite has quite large differences in the values and ranges of data. Again, whether this is an issue depends on what we want to do, but in our research question we want to know how each metabolite influences T1D. In order to best interpret the results and compare across metabolites, we should ideally have all the metabolites with a similar range and distribution of values.

Let’s fix the first two issues first, and at the same time tidy up the names of each metabolite so they make better column names. In the doc/learning.qmd, we can tidy up the metabolite names using mutate() with snakecase::to_snake_case(). Secondly, to make our dataset wider, we use pivot_wider(). Since we also want an easy way of identifying columns that are metabolites, we will add a "metabolite_" prefix using the argument names_prefix. To actually fix the multiple cholesterol issue, we should look more into the data documentation or contact the authors. But for this course, we will merge the values by calculating a mean before pivoting. We do this by setting the values_fn with mean.

doc/learning.qmd
lipidomics_wide <- lipidomics |>
  mutate(metabolite = snakecase::to_snake_case(metabolite)) |>
  pivot_wider(
    names_from = metabolite,
    values_from = value,
    values_fn = mean,
    names_prefix = "metabolite_"
  )
lipidomics_wide
# A tibble: 36 Γ— 16
   code   gender   age class metabolite_tms_interntal_standard
   <chr>  <chr>  <dbl> <chr>                             <dbl>
 1 ERI109 M         25 CT                                208. 
 2 ERI111 M         39 CT                                219. 
 3 ERI163 W         58 CT                                 57.1
 4 ERI375 M         24 CT                                 19.2
 5 ERI376 M         26 CT                                 35.4
 6 ERI391 M         31 CT                                 30.4
 7 ERI392 M         24 CT                                 21.7
 8 ERI79  W         26 CT                                185. 
 9 ERI81  M         52 CT                                207. 
10 ERI83  M         25 CT                                322. 
# β„Ή 26 more rows
# β„Ή 11 more variables: metabolite_cholesterol <dbl>,
#   metabolite_lipid_ch_3_1 <dbl>, metabolite_lipid_ch_3_2 <dbl>,
#   metabolite_lipid_ch_2 <dbl>, metabolite_fa_ch_2_ch_2_coo <dbl>,
#   metabolite_pufa <dbl>, metabolite_phosphatidylethanolamine <dbl>,
#   metabolite_phosphatidycholine <dbl>,
#   metabolite_phospholipids <dbl>, metabolite_mufa_pufa <dbl>, …

Since we’re using a function-oriented workflow and since we will be using this code again later on, let’s convert both the β€œmetabolite to snakecase” and β€œpivot to wider” code into their own functions. Lets start writing the function in the doc/learning.qmd (we will later move them over into the R/functions.R file).

doc/learning.qmd
column_values_to_snake_case <- function(data) {
  data |>
    dplyr::mutate(metabolite = snakecase::to_snake_case(metabolite))
}
lipidomics |>
  column_values_to_snake_case()
# A tibble: 504 Γ— 6
   code   gender   age class metabolite                value
   <chr>  <chr>  <dbl> <chr> <chr>                     <dbl>
 1 ERI109 M         25 CT    tms_interntal_standard   208.  
 2 ERI109 M         25 CT    cholesterol               19.8 
 3 ERI109 M         25 CT    lipid_ch_3_1              44.1 
 4 ERI109 M         25 CT    lipid_ch_3_2             147.  
 5 ERI109 M         25 CT    cholesterol               27.2 
 6 ERI109 M         25 CT    lipid_ch_2               587.  
 7 ERI109 M         25 CT    fa_ch_2_ch_2_coo          31.6 
 8 ERI109 M         25 CT    pufa                      29.0 
 9 ERI109 M         25 CT    phosphatidylethanolamine   6.78
10 ERI109 M         25 CT    phosphatidycholine        41.7 
# β„Ή 494 more rows

This on its own should work. However, the column we want to change might not always be called metabolite, or we might want to change it later. So, to make this function a bit more generic, we can use something called β€œcurly-curly” (it looks like {{}} when used) and β€œnon-standard evaluation” (NSE).

πŸ“– Reading task: ~10 minutes

When you write your own functions that make use of functions in the tidyverse, you may eventually encounter an error that might not be very easy to figure out. Here’s a very simple example using select(), where one of your function’s arguments is to select columns (just read, don’t write any code):

test_nse <- function(data, columns) {
  data |>
    dplyr::select(columns)
}

lipidomics |>
  test_nse(class)
Error in `dplyr::select()`:
! Can't select columns that don't exist.
βœ– Column `columns` doesn't exist.

The error occurs because of something called β€œnon-standard evaluation” (or NSE). NSE is a major feature of R and is used quite a lot throughout R. NSE is used a lot in the tidyverse packages. It’s one of the first things computer scientists complain about when they use R, because it is not a common thing in other programming languages. But NSE is what allows you to use formulas (e.g. y ~ x + x2 in modeling, which we will show shortly) or allows you to type out select(class, age) or library(purrr). In β€œstandard evaluation”, these would instead be select("Gender", "BMI") or library("purrr"). So NSE gives flexibility and ease of use for the user (we don’t have to type quotes every time) when doing data analysis, but can give some headaches when programming in R, like when making functions. There’s more detail about this on the dplyr website, which lists some options to handle NSE while programming. The easiest approach is to wrap the argument with β€œcurly-curly” ({{}}).

test_nse <- function(data, columns) {
  data |>
    dplyr::select({{ columns }})
}

lipidomics |>
  test_nse(class)
# A tibble: 504 Γ— 1
   class
   <chr>
 1 CT   
 2 CT   
 3 CT   
 4 CT   
 5 CT   
 6 CT   
 7 CT   
 8 CT   
 9 CT   
10 CT   
# β„Ή 494 more rows
lipidomics |>
  test_nse(c(class, age))
# A tibble: 504 Γ— 2
   class   age
   <chr> <dbl>
 1 CT       25
 2 CT       25
 3 CT       25
 4 CT       25
 5 CT       25
 6 CT       25
 7 CT       25
 8 CT       25
 9 CT       25
10 CT       25
# β„Ή 494 more rows
Finished? πŸ‘’

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the instructor πŸ‘’ 🎩

You don’t need to go over what they read, you can continue with making the function below. Unless learners have some questions.

We can use curly-curly (combined with across()) to apply snakecase::to_snake_case() to columns of our choice.

doc/learning.qmd
column_values_to_snake_case <- function(data, columns) {
  data |>
    dplyr::mutate(dplyr::across({{ columns }}, snakecase::to_snake_case))
}

lipidomics |>
  column_values_to_snake_case(metabolite)
# A tibble: 504 Γ— 6
   code   gender   age class metabolite                value
   <chr>  <chr>  <dbl> <chr> <chr>                     <dbl>
 1 ERI109 M         25 CT    tms_interntal_standard   208.  
 2 ERI109 M         25 CT    cholesterol               19.8 
 3 ERI109 M         25 CT    lipid_ch_3_1              44.1 
 4 ERI109 M         25 CT    lipid_ch_3_2             147.  
 5 ERI109 M         25 CT    cholesterol               27.2 
 6 ERI109 M         25 CT    lipid_ch_2               587.  
 7 ERI109 M         25 CT    fa_ch_2_ch_2_coo          31.6 
 8 ERI109 M         25 CT    pufa                      29.0 
 9 ERI109 M         25 CT    phosphatidylethanolamine   6.78
10 ERI109 M         25 CT    phosphatidycholine        41.7 
# β„Ή 494 more rows

Move this new function over into the R/functions.R file, add Roxygen documentation with Ctrl-Shift-Alt-R or with the Palette (Ctrl-Shift-P, then type β€œroxygen comment”), style using the Palette (Ctrl-Shift-P, then type β€œstyle file”), source() the modified R/functions.R file with Ctrl-Shift-S or with the Palette (Ctrl-Shift-P, then type β€œsource”)

R/functions.R
#' Convert a column's character values to snakecase format.
#'
#' @param data The lipidomics dataset.
#' @param columns The column you want to convert into the snakecase format.
#'
#' @return A data frame.
#'
column_values_to_snake_case <- function(data, columns) {
  data |>
    dplyr::mutate(dplyr::across({{ columns }}, snakecase::to_snake_case))
}

Now, replace the mutate() call with our new function (above the pivot_wider() code) in the doc/learning.qmd file.

doc/learning.qmd
lipidomics_wide <- lipidomics |>
  column_values_to_snake_case(metabolite) |>
  pivot_wider(
    names_from = metabolite,
    values_from = value,
    values_fn = mean,
    names_prefix = "metabolite_"
  )
lipidomics_wide
# A tibble: 36 Γ— 16
   code   gender   age class metabolite_tms_interntal_standard
   <chr>  <chr>  <dbl> <chr>                             <dbl>
 1 ERI109 M         25 CT                                208. 
 2 ERI111 M         39 CT                                219. 
 3 ERI163 W         58 CT                                 57.1
 4 ERI375 M         24 CT                                 19.2
 5 ERI376 M         26 CT                                 35.4
 6 ERI391 M         31 CT                                 30.4
 7 ERI392 M         24 CT                                 21.7
 8 ERI79  W         26 CT                                185. 
 9 ERI81  M         52 CT                                207. 
10 ERI83  M         25 CT                                322. 
# β„Ή 26 more rows
# β„Ή 11 more variables: metabolite_cholesterol <dbl>,
#   metabolite_lipid_ch_3_1 <dbl>, metabolite_lipid_ch_3_2 <dbl>,
#   metabolite_lipid_ch_2 <dbl>, metabolite_fa_ch_2_ch_2_coo <dbl>,
#   metabolite_pufa <dbl>, metabolite_phosphatidylethanolamine <dbl>,
#   metabolite_phosphatidycholine <dbl>,
#   metabolite_phospholipids <dbl>, metabolite_mufa_pufa <dbl>, …

7.6 πŸ§‘β€πŸ’» Exercise: Convert the pivot code into a function

Time: ~10 minutes.

Just like with the mutate(), take the pivot_wider() code and convert it into a new function.

  1. Name the new function metabolites_to_wider.
  2. Include one argument in the new function(): data.
  3. Use data |> at the beginning, like we did with the column_values_to_snake_case().
  4. Use tidyr:: before the pivot_wider() function.
  5. Add the Roxygen documentation with Ctrl-Shift-Alt-R or with the Palette (Ctrl-Shift-P, then type β€œroxygen comment”).
  6. Move the function into the R/functions.R file.
  7. Replace the code in the doc/learning.qmd file to make use of the new functions.
Click for the solution. Only click if you are struggling or are out of time.
#' Convert the metabolite long format into a wider one.
#'
#' @param data The lipidomics dataset.
#'
#' @return A wide data frame.
#'
metabolites_to_wider <- function(data) {
  data |>
    tidyr::pivot_wider(
      names_from = metabolite,
      values_from = value,
      values_fn = mean,
      names_prefix = "metabolite_"
    )
}
Finished? πŸ‘’

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the instructor πŸ‘’ 🎩

7.7 Using recipes to manage transformations

Verbally talk through the next few paragraphs, no code-along yet.

We’ve used dplyr and tidyr to start fixing some of the issues with the data. But we still have the third issue: How to make the results between metabolites comparable. That’s where we use recipes.

The first function is recipe() and it takes two forms: with or without a formula. Remember the model formula we mentioned previously? Well, here is where we can use it to tell recipes about the model formula we intend to use so it knows on what variables to apply the chosen transformations.

For a few reasons that will be clearer later, we won’t ultimately use the formula form of recipe(), but will show how it works:

recipe(class ~ metabolite_cholesterol + age + gender, data = lipidomics_wide)

The alternative approach is to set β€œroles” using update_roles(). Instead of using a formula and letting recipe() infer the outcome and predictors, we can explicitly select which variables are which. This has some nice features that we will use later on.

doc/learning.qmd
recipe(lipidomics_wide) |>
  update_role(metabolite_cholesterol, age, gender, new_role = "predictor") |>
  update_role(class, new_role = "outcome")
── Recipe ──────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:          1
predictor:        3
undeclared role: 12

Describe this next paragraph by switching to this website and showing the list of some of the steps available in the code chunk below.

The next β€œstep” is to select a transformation function. There are many step_* functions (some shown below) that are available in recipes and which one you decide to use depends heavily on your data and your research question. So take your time thinking through which transformations you actually need for your analysis, rather than quickly using one or more of them.

recipes::step_log()
recipes::step_scale()
recipes::step_normalize()
recipes::step_center()
recipes::step_sqrt()
Tip

There are so many useful transformation functions available. For instance, if you often have to impute data, there are functions for that. You can check them out in the Console by typing recipes::step_impute_ then hit the Tab key to see a list of them. Or, if you have some missing values, there’s also the recipes::step_naomit().

There are many transformations we could use for the lipidomics dataset, but we will use step_normalize() for this course. This function is useful because it makes each variable centered to zero and a value of 1 unit is translated to 1 standard deviation of the original distribution. This means we can more easily compare values between variables. We can add this to the end of the recipe:

doc/learning.qmd
recipe(lipidomics_wide) |>
  update_role(metabolite_cholesterol, age, gender, new_role = "predictor") |>
  update_role(class, new_role = "outcome") |>
  step_normalize(starts_with("metabolite_"))
── Recipe ──────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:          1
predictor:        3
undeclared role: 12
── Operations 
β€’ Centering and scaling for: starts_with("metabolite_")

Next thing to do is convert this into a function, using the same workflow we’ve been using (which means this needs to be in the R/functions.R script). We’ll also use the curly-curly again, since we might use a different metabolite later. Note, when adding all the packagename:: to each function, the starts_with() function comes from the tidyselect package.

R/functions.R
#' A transformation recipe to pre-process the data.
#'
#' @param data The lipidomics dataset.
#' @param metabolite_variable The column of the metabolite variable.
#'
#' @return
#'
create_recipe_spec <- function(data, metabolite_variable) {
  recipes::recipe(data) |>
    recipes::update_role({{ metabolite_variable }}, age, gender, new_role = "predictor") |>
    recipes::update_role(class, new_role = "outcome") |>
    recipes::step_normalize(tidyselect::starts_with("metabolite_"))
}

And test it out:

doc/learning.qmd
recipe_specs <- lipidomics_wide |>
  create_recipe_spec(metabolite_cholesterol)
recipe_specs
── Recipe ──────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:          1
predictor:        3
undeclared role: 12
── Operations 
β€’ Centering and scaling for: tidyselect::starts_with("metabolite_")

Style using the Palette (Ctrl-Shift-P, then type β€œstyle file”), then commit the changes made to the Git history with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type β€œcommit”).

7.8 Fitting the model by combining the recipe, model definition, and data

We’ve now defined the model we want to use and created a transformation recipes specification. Now we can start putting them together and finally fit them to the data. This is done with the workflows package.

Why use this package, rather than simply run the statistical analysis and process the data as normal? When running multiple models (like we will do in the next section) that may require different data structures, the data transformation steps have to happen right before the data is fit to the model and need to be done on exactly the data used by the model. So if we have one data frame that we run multiple models on, but the transformation happens to the whole data frame, we could end up with issues due to how the transformations were applied. The workflows package keeps track of those things for us, so we can focus on the higher level thinking rather than on the small details of running the models.

The workflows package has a few main functions for combining the recipe with the model specs, as well as several for updating an existing workflow (which might be useful if you need to run many models of slightly different types). All model workflows need to start with workflow(), followed by two main functions: add_model() and add_recipe(). Can you guess what they do?

doc/learning.qmd
workflow() |>
  add_model(log_reg_specs) |>
  add_recipe(recipe_specs)
══ Workflow ════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────
1 Recipe Step

β€’ step_normalize()

── Model ───────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Computational engine: glm 

While this code is already pretty concise, let’s convert it into a function to make it simplified. We’ll use the same function-oriented workflow that we’ve used before, where the function should ultimately be inside the R/functions.R file.

R/functions.R
#' Create a workflow object of the model and transformations.
#'
#' @param model_specs The model specs
#' @param recipe_specs The recipe specs
#'
#' @return A workflow object
#'
create_model_workflow <- function(model_specs, recipe_specs) {
  workflows::workflow() |>
    workflows::add_model(model_specs) |>
    workflows::add_recipe(recipe_specs)
}

Instead of using the previously created objects, let’s start the model creation from scratch:

doc/learning.qmd
model_workflow <- create_model_workflow(
  logistic_reg() |>
    set_engine("glm"),
  lipidomics_wide |>
    create_recipe_spec(metabolite_cholesterol)
)
model_workflow
══ Workflow ════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────
1 Recipe Step

β€’ step_normalize()

── Model ───────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Computational engine: glm 

Now, we can do the final thing: Fitting the data to the model with fit()! πŸ™Œ

doc/learning.qmd
fitted_model <- model_workflow |>
  fit(lipidomics_wide)
fitted_model
══ Workflow [trained] ══════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────
1 Recipe Step

β€’ step_normalize()

── Model ───────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
           (Intercept)                 genderW                     age  
              0.105293               -0.706550                0.006902  
metabolite_cholesterol  
              1.087989  

Degrees of Freedom: 35 Total (i.e. Null);  32 Residual
Null Deviance:      49.91 
Residual Deviance: 42.8     AIC: 50.8

This gives us a lot of information, but what we are mostly interested in is the model estimates themselves. While this fitted_model object contains a lot additional information inside, workflows thankfully has a function to extract the information we want. In this case, it is the extract_fit_parsnip() function.

doc/learning.qmd
fitted_model |>
  extract_fit_parsnip()
parsnip model object


Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
           (Intercept)                 genderW                     age  
              0.105293               -0.706550                0.006902  
metabolite_cholesterol  
              1.087989  

Degrees of Freedom: 35 Total (i.e. Null);  32 Residual
Null Deviance:      49.91 
Residual Deviance: 42.8     AIC: 50.8

To get this information in a tidier format, we use another function: tidy(). This function comes from the broom package, which is part of the tidymodels. But we should explicitly add it to the dependencies:

Console
use_package("broom")

Then, we add the tidy() function to our model using the |> pipe. Since we are using a logistic regression model, we need to consider how we want the estimates to be presented, probably depending on how we want to visualize our results. If we set exponentiate = TRUE in tidy(), the output estimates will be odds ratios, if we set exponentiate = FALSE, we will get the log odds ratios. Here we choose exponentiate = TRUE:

doc/learning.qmd
fitted_model |>
  extract_fit_parsnip() |>
  tidy(exponentiate = TRUE)
# A tibble: 4 Γ— 5
  term                   estimate std.error statistic p.value
  <chr>                     <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)               1.11     1.29      0.0817  0.935 
2 genderW                   0.493    0.779    -0.907   0.365 
3 age                       1.01     0.0377    0.183   0.855 
4 metabolite_cholesterol    2.97     0.458     2.38    0.0175

We now have a data frame of our model results!

Reinforce not to worry about interpreting the results, that is not the aim of this course.

Another reminder, we are not interpreting this results. For this course, they are not important and can distract from the main purpose.

Like we did with the workflows() code that we converted into a function, we do the same thing here: Make another function (and move it to R/functions.R)! πŸ˜›

R/functions.R
#' Create a tidy output of the model results.
#'
#' @param workflow_fitted_model The model workflow object that has been fitted.
#'
#' @return A data frame.
#'
tidy_model_output <- function(workflow_fitted_model) {
  workflow_fitted_model |>
    workflows::extract_fit_parsnip() |>
    broom::tidy(exponentiate = TRUE)
}

Replacing the code in the doc/learning.qmd file to use the function.

doc/learning.qmd
fitted_model |>
  tidy_model_output()
# A tibble: 4 Γ— 5
  term                   estimate std.error statistic p.value
  <chr>                     <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)               1.11     1.29      0.0817  0.935 
2 genderW                   0.493    0.779    -0.907   0.365 
3 age                       1.01     0.0377    0.183   0.855 
4 metabolite_cholesterol    2.97     0.458     2.38    0.0175

If we revise the code so it is one pipe, it would look like:

doc/learning.qmd
create_model_workflow(
  logistic_reg() |>
    set_engine("glm"),
  lipidomics_wide |>
    create_recipe_spec(metabolite_cholesterol)
) |>
  fit(lipidomics_wide) |>
  tidy_model_output()
# A tibble: 4 Γ— 5
  term                   estimate std.error statistic p.value
  <chr>                     <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)               1.11     1.29      0.0817  0.935 
2 genderW                   0.493    0.779    -0.907   0.365 
3 age                       1.01     0.0377    0.183   0.855 
4 metabolite_cholesterol    2.97     0.458     2.38    0.0175

Let’s briefly cover what these columns and values mean.

If you want, you can go over these details briefly or in more detail, depending on how comfortable you are. Or you can get them to read it only.

πŸ“– Reading task: ~10 minutes

Let’s explain this output a bit, each column at a time:

  • term: If you recall the formula \(class = metabolite + sex + gender\), you’ll see all but the class object there in the column term. This column contains all the predictor variables, including the intercept (from the original model).

  • estimate: This column is the β€œcoefficient” linked to the term in the model. The final mathematical model here looks like:

    \[ \displaylines{class = Intercept + (metabolite\_estimate \times metabolite\_value) + \\ (gender\_estimate \times gender\_value) + ...}\]

    In our example, we chose to get the odds ratios. In the mathematical model above, the estimate is represented as the log odds ratio or beta coefficient - the constant value you multiply the value of the term with. Interpreting each of these values can be quite tricky and can take a surprising amount of time to conceptually break down, so we won’t do that here, since this isn’t a statistics course. The only thing you need to understand here is that the estimate is the value that tells us the magnitude of association between the term and class. This value, along with the std.error are the most important values we can get from the model and we will be using them when presenting the results.

  • std.error: This is the uncertainty in the estimate value. A higher value means there is less certainty in the value of the estimate.

  • statistic: This value is used to, essentially, calculate the p.value.

  • p.value: This is the infamous value we researchers go crazy for and think nothing else of. While there is a lot of attention to this single value, we tend to give it more attention than warranted. The interpretation of the p-value is even more difficult than the estimate and again, we won’t cover this in this course. We won’t be using this value at all in presenting the results.

Finished? πŸ‘’

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the instructor πŸ‘’ 🎩

Before ending, open the Git interface and commit the changes you made with Ctrl-Alt-M or with the Palette (Ctrl-Shift-P, then type β€œcommit”). Then push your changes up to GitHub.

7.9 Summary

  • Create research questions that (ideally) are structured in a way to mimic how the statistical analysis will be done, preferably in a β€œformula” style like \(y = x1 + x2 + ... + error\) and in a diagram style with links connecting variables.
  • Statistical analyses, while requiring some trial and error, are surprisingly structured in the workflow and steps taken. Use this structure to help guide you in completing tasks related to running analyses.
  • Use parsnip functions to define the model you want to use, like logistic_reg() for logistic regression, and set the computational β€œengine” with set_engine().
  • Use recipes functions to set up the data transformation steps necessary to effectively run the statistical analysis, like adding variable β€œroles” (outcome vs predictor) using update_roles() and adding transformation steps using any of the dozen different step_ functions.
  • Use workflows functions to develop an analysis workflow() that combines the defined model with add_model(), the transformation steps with add_recipe(), and the data with fit().
  • Use broom to tidy() the model output, extracted using extract_fit_parsnip() to get a data frame of the estimates and standard error for the variables in the model.

This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.

use_package("tidymodels", "depends")
# pak::pak("tidymodels")
use_package("parsnip")
use_package("recipes")
use_package("workflows")
log_reg_specs <- logistic_reg() |>
  set_engine("glm")
log_reg_specs
lipidomics |>
  count(code, metabolite) |>
  filter(n > 1)
lipidomics_wide <- lipidomics |>
  mutate(metabolite = snakecase::to_snake_case(metabolite)) |>
  pivot_wider(
    names_from = metabolite,
    values_from = value,
    values_fn = mean,
    names_prefix = "metabolite_"
  )
lipidomics_wide
column_values_to_snake_case <- function(data) {
  data |>
    dplyr::mutate(metabolite = snakecase::to_snake_case(metabolite))
}
lipidomics |>
  column_values_to_snake_case()
column_values_to_snake_case <- function(data, columns) {
  data |>
    dplyr::mutate(dplyr::across({{ columns }}, snakecase::to_snake_case))
}

lipidomics |>
  column_values_to_snake_case(metabolite)
#' Convert a column's character values to snakecase format.
#'
#' @param data The lipidomics dataset.
#' @param columns The column you want to convert into the snakecase format.
#'
#' @return A data frame.
#'
column_values_to_snake_case <- function(data, columns) {
  data |>
    dplyr::mutate(dplyr::across({{ columns }}, snakecase::to_snake_case))
}
lipidomics_wide <- lipidomics |>
  column_values_to_snake_case(metabolite) |>
  pivot_wider(
    names_from = metabolite,
    values_from = value,
    values_fn = mean,
    names_prefix = "metabolite_"
  )
lipidomics_wide
#' Convert the metabolite long format into a wider one.
#'
#' @param data The lipidomics dataset.
#'
#' @return A wide data frame.
#'
metabolites_to_wider <- function(data) {
  data |>
    tidyr::pivot_wider(
      names_from = metabolite,
      values_from = value,
      values_fn = mean,
      names_prefix = "metabolite_"
    )
}
recipe(class ~ metabolite_cholesterol + age + gender, data = lipidomics_wide)
recipe(lipidomics_wide) |>
  update_role(metabolite_cholesterol, age, gender, new_role = "predictor") |>
  update_role(class, new_role = "outcome")
recipe(lipidomics_wide) |>
  update_role(metabolite_cholesterol, age, gender, new_role = "predictor") |>
  update_role(class, new_role = "outcome") |>
  step_normalize(starts_with("metabolite_"))
#' A transformation recipe to pre-process the data.
#'
#' @param data The lipidomics dataset.
#' @param metabolite_variable The column of the metabolite variable.
#'
#' @return
#'
create_recipe_spec <- function(data, metabolite_variable) {
  recipes::recipe(data) |>
    recipes::update_role({{ metabolite_variable }}, age, gender, new_role = "predictor") |>
    recipes::update_role(class, new_role = "outcome") |>
    recipes::step_normalize(tidyselect::starts_with("metabolite_"))
}
recipe_specs <- lipidomics_wide |>
  create_recipe_spec(metabolite_cholesterol)
recipe_specs
workflow() |>
  add_model(log_reg_specs) |>
  add_recipe(recipe_specs)
#' Create a workflow object of the model and transformations.
#'
#' @param model_specs The model specs
#' @param recipe_specs The recipe specs
#'
#' @return A workflow object
#'
create_model_workflow <- function(model_specs, recipe_specs) {
  workflows::workflow() |>
    workflows::add_model(model_specs) |>
    workflows::add_recipe(recipe_specs)
}
model_workflow <- create_model_workflow(
  logistic_reg() |>
    set_engine("glm"),
  lipidomics_wide |>
    create_recipe_spec(metabolite_cholesterol)
)
model_workflow
fitted_model <- model_workflow |>
  fit(lipidomics_wide)
fitted_model
fitted_model |>
  extract_fit_parsnip()
use_package("broom")
fitted_model |>
  extract_fit_parsnip() |>
  tidy(exponentiate = TRUE)
#' Create a tidy output of the model results.
#'
#' @param workflow_fitted_model The model workflow object that has been fitted.
#'
#' @return A data frame.
#'
tidy_model_output <- function(workflow_fitted_model) {
  workflow_fitted_model |>
    workflows::extract_fit_parsnip() |>
    broom::tidy(exponentiate = TRUE)
}
fitted_model |>
  tidy_model_output()
create_model_workflow(
  logistic_reg() |>
    set_engine("glm"),
  lipidomics_wide |>
    create_recipe_spec(metabolite_cholesterol)
) |>
  fit(lipidomics_wide) |>
  tidy_model_output()