%%{init:{'theme':'forest', 'flowchart':{'nodeSpacing': 20, 'rankSpacing': 30}}}%% graph LR Sunlight --> Growth Water --> Growth Sunlight --- Water linkStyle 0,1,2 stroke-width:1px;
If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.
On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.
7 A general approach to doing statistical analyses
Running statistical analyses is a relatively methodical and well-defined process, even if there is often a lot of trial and error involved. Sometimes it may feel overwhelming and complicated, which it definitely can be, but it doesn’t have to be. By taking a structured approach to running statistical analyses, you can make it easier on yourself and feel more in control.
In R, statistical methods are often created and developed by researchers with little to no training in software development and who often use them differently. This has some advantages, like having the cutting edge statistical methods available to us, but has a major disadvantage of often having to learn a completely different way of running a statistical analysis, even if it is fairly similar to ones you’ve used before. So having a framework for running statistical analyses, regardless of who created them, can provide that needed structure and vastly simplify the analysis. This session will be covering a general framework for running statistical analyses, regardless of the exact statistical method.
7.1 Learning objectives
The overall objective for this session is to:
- Describe the basic framework underlying most statistical analyses and use R to generate statistical results using this framework.
More specific objectives are to:
- Describe the general “workflow” and steps involved in stating the research question, constructing a model to help answer the question, preparing the data to match the requirements of the model, fitting it to the data, and finally extracting the results from the fitted model.
- Categorize the model definition step as a distinct, theory-driven step, separate from the data, and use
{parsnip}
functions to help with defining your model. - Identify various data transformation techniques and evaluate which are good options given the data. Use functions in the
{recipes}
package to apply these transformations to your data. - Use the
{broom}
package to extract the model results to later present them in graphical format (with{ggplot2}
). - Continue applying the concepts and functions used from the previous sessions.
Specific “anti”-objectives:
- Will not know how to choose and apply the appropriate statistical model or test, nor understand any statistical theory, nor interpret the statistical results correctly, nor determine the relevant data transformations for the statistical approach. What we show, we show only as demonstration purposes only, they could be entirely wrong in how to do them correctly if an expert were to review them.
We will be making a lot of function throughout this session and the next. This is just a fair warning!
7.2 Exercise: What does a “model” mean?
Time: ~8 minutes.
In science and especially statistics, we talk a lot about “models”. But what does model actually mean? What different types of definitions can you think of? Is there a different understanding of model in statistics compared to other areas?
- Take 1 minute to think about your understanding of a model.
- Then, over the next 4 minutes, discuss with your neighbour about the meaning of “model” and see if you can come to a shared understanding.
- Finally, over the next 3 minutes, we will share all together what a model is in the context of data analysis.
7.3 Theory and “workflow” on statistical modeling
Let them read it over than go over it briefly, focusing on what a model is, that we should create research questions while thinking in the framework of models, and the general workflow for doing statistical analyses.
7.4 Reading task: ~15 minutes
Almost all fields of research, and definitely more heavily-quantitative and scientific fields like biomedicine and health, have math and statistics at the core of taking data to draw inferences or general observations about the world.
Any time we collect data and need to interpret what it means, we need statistics. And anytime we want to make inferences about the world from the data, we need to use statistics to determine the likelihood, or rather the uncertainty, in those inferences. Statistics is meant to quantify uncertainty.
How do we quantify uncertainty? By first creating a “theoretical model” that expresses mathematically our research question. For instance, we have a theoretical model that outdoor plants grow (non-linearly) with water and sunlight, but that more sunlight likely means less water (less rain). While any research question could be translated into a theoretical model, not all theoretical models can be tested against the real world. And that’s where the second part comes in: Our theoretical model needs to be structured in a way that allows us to measure the items (“parameters”) in our model, so that we can build a mathematical model from the data (“test it in the real world”).
Great research questions are designed in a way to fit a theoretical model on measurable parameters, so we can ultimately quantify the uncertainty in our observations (the data) and in the model. And the basic simplified math of a statistical model looks mostly the same:
\[y = intercept + x + error \]
So if we use the example of plant growth, it would look like:
\[Growth = Sunlight + Water\]
It’s a bit more complicated than this, but this is enough to describe it for this course. Throughout the rest of the session, we use specific terms to describe each item in this formula. “Outcome” (also called “dependent variable”) refers to the \(y\), “predictor” (or “independent variable”) refers to the \(x\). The error and intercept are calculated for us when we fit the model to the data. The intercept is when x is equal to zero (the “y-intercept” on a plot). The error is the difference between what the model estimates and the real value. It plays a role in quantifying the model’s uncertainty.
Considering the mathematical nature of statistical models, there is also a logic and “workflow” to making these models!
- Write a research question, designed in a way that might look like the diagram above. Usually this step needs to be revisited, revising the question after trying to construct the theoretical model, that describes the measurable (and unmeasurable) parameters in the model, and vice versa.
- Based on the model and the type of measured parameters used (continuous or binary), select the best mathematical model “type”. Nearly all models in statistics start from the base of a linear regression (e.g. ANOVA is a special form of regression, t-test is a simpler version of ANOVA), so the model “type” will probably be a form of regression.
- Measure your parameters (in the plant growth example, that might be the amount of water given in liters per day, amount of plant growth in weight, and amount of sunlight in hrs per day). Usually, this measured data need to be processed in a special way to fit the specifics of the model, research question, and practices of the field.
- Fit the data to the theoretical model in order to estimate the values (“coefficients”) of the model parameters as well as the uncertainty in those values.
- Extract the values and their uncertainty from the model and present them in relation to your research questions.
%%{init:{'theme':'forest', 'flowchart':{'nodeSpacing': 40, 'rankSpacing': 20}}}%% graph TD A[Research question] --> B((Statistical model)) B --> A B --> C[Data collection] C --> D[Data transformation] D --> E[Scientific output] E --> A linkStyle 0,1,2,3,4,5 stroke-width:1px;
The entire workflow for building statistical models requires highly specific domain knowledge on not only the statistics themselves, but also how the data was collected, what the values mean, what type of research questions to ask and how to ask them, how to interpret the results from the models, and how to process the data to fit the question and model.
For instance, in our lipidomics
dataset, if we were to actually use this data, we would need someone familiar with -omic technologies, how the data are measured, what the values actually mean, how to prepare them for the modeling, the specific modeling methods used for this field, and how we would actually interpret the findings. We have none of these things, so very likely we are doing things quite wrong here. We’re only doing this modeling to highlight how to use the R packages.
A few things to repeat and reinforce:
- The workflow of the image and that it all starts with the research question.
- The fact that almost all statistical methods are basically special forms of linear regression.
- That this model creation stage requires a variety of domain expertise, not just statistical expertise.
Going back to our own lipidomics
dataset, we need to do the first step: Creating the question. While we don’t have much data, there are a surprising number of questions we could ask. But we will keep it very simple, very basic, and very exploratory.
- What is the estimated relationship of each metabolite with T1D compared to the controls, adjusting for the influence of age and gender?
- What is the variability for the estimate in each relationship?
Next, because we are working within a “reproducible analysis” framework specifically with the use of {targets}
, let’s convert these questions in outputs to include as pipeline targets, along with a basic idea for the final functions that will make up these targets and their inputs and outputs. These targets will probably be quite different by the end, but it’s a good start to think about what it should look like in the end.
- All results for estimated relationships (in case we want to use it for other output)
- All results for variation in estimates of relationships (in case we want to use it for other output)
- Plot of statistical estimate for each relationship
- Plot of variation in estimates for each relationship
Potential function names might be:
calculate_estimates()
calculate_variation()
plot_estimates()
plot_variation()
%%{init:{'theme':'forest', 'flowchart':{'nodeSpacing': 20, 'rankSpacing': 30}}}%% graph TB lipidomics -- "calculate_estimates()" --> model_est[Model estimates] model_est -- "plot_estimates()" --> plot_est[Plot of estimates] lipidomics -- "calculate_variation()" --> model_var[Model variation] model_var -- "plot_variation()" --> plot_var[Plot of variation] plot_est & plot_var -- "tar_read()" --> rmd[R Markdown] linkStyle 0,1,2,3,4,5 stroke-width:1px;
7.5 Defining the model
Verbally walk through this section, describing the theoretical model both graphically and mathematically. Go through why we use {tidymodels}
rather than other approaches.
Now that we’ve talked about the workflow around making models and have already written out some research questions, let’s make a basic, graphical theoretical model:
%%{init:{'theme':'forest', 'flowchart':{'nodeSpacing': 20, 'rankSpacing':30}, 'themeVariables': { 'edgeLabelBackground': 'transparent'}}}%% graph TB Metabolite --> T1D Age & Gender --> T1D & Metabolite linkStyle 0,1,2,3,4 stroke-width:1px;
Or mathematically:
\[T1D = metabolite + age + gender\]
So, T1D status (or class
in the lipidomics
dataset) is our outcome and the individual metabolite, age, and gender are our predictors. Technically, age and gender would be “confounders” or “covariates”, since we include them only because we think they influence the relationship between the metabolite and T1D.
If we convert the formula into a form with the variables we have in the dataset as well as selecting only one metabolite for now (the cholesterol metabolite, which we add “metabolite” to differentiate it from other potential variables), it would be:
\[class = metabolite\_cholesterol + age + gender\]
Now that we have a theoretical model, we need to choose our model type. Since T1D is binary (either you have it or you don’t), the most likely choice is logistic regression, which requires a binary outcome variable. So we have the theoretical model and the type of model to use - how do we express this as code in R? There are many ways of doing the same thing in R, but some are a bit easier than others. One such approach, that is quite generic and fits with the ideals of the {tidyverse}
, is a similar universe of packages called the {tidymodels}
.
Why do we teach {tidymodels}
? Because they are built by software developers, employed by RStudio (who also employs the people who build the {tidyverse}
), and they have a strong reputation for writing good documentation. Plus, the {tidymodels}
set of packages also make creating and using models quite generic, so by teaching you these sets of tools, you can relatively easily change the model type, or how you process the data, or other specifications without having to learn a whole new package or set of tools.
The reason {tidymodels}
can do that is because they designed it in a way that makes a clear separation in the components of the model building workflow that was described above, through the use of specific packages for each component.
Package | Description |
---|---|
{parsnip} |
Model definition, such as type (e.g. linear_reg() ) and “engine” (e.g. glm() ). |
{recipes} |
Model-specific data transformations, such as removing missing values, or standardizing the data. |
{workflows} |
Combining model definition, data, and transformations to calculate the estimates and uncertainty. |
We’ll start with the {parsnip}
package. Functions in this package are used to set the details of the model you want to use. Specifically, functions to indicate the model “type” (e.g. linear regression) and the set_engines()
function to determine the “engine” to run the type (which R-based algorithm to use, like glm()
compared to lm()
). Check out the Examples page for code you might use depending on the model you want. The most commonly used model types would be linear_reg()
, logistic_reg()
, and multinom_reg()
.
We want to use logistic regression. So, open the doc/lesson.qmd
file and in the setup
code chunk add library(tidymodels)
, so it looks like:
```{{r setup}}
targets::tar_config_set(store = here::here("_targets"))
library(tidyverse)
library(tidymodels)
source(here::here("R/functions.R"))
lipidomics <- tar_read(lipidomics)
```
Then, delete everything below this code chunk.
Since we will be using {tidymodels}
, we need to install it, as well as explicitly add the {parsnip}
, {recipes}
, and {workflows}
packages. Like {tidyverse}
, we need to set {tidymodels}
differently because it is a “meta-package”. We might need to force installing it with install.packages("tidymodels")
so {renv}
recognizes it.
use_package("tidymodels", "depends")
# install.packages("tidymodels")
use_package("parsnip")
use_package("recipes")
use_package("workflows")
Before continuing, let’s commit the changes to the Git history. Next, in the doc/lesson.qmd
file, on the bottom of the document create a new header and code chunk:
## Building the model
```{r}
```
In the new code chunk, we will set up the model specs:
<- logistic_reg() %>%
log_reg_specs set_engine("glm")
log_reg_specs
Logistic Regression Model Specification (classification)
Computational engine: glm
Running this on it’s own doesn’t show much, as you can see. But we’ve now set the model we want to use.
7.6 Exercise: How would you define a linear regression with parsnip?
Time: ~10 minutes.
Using parsnip’s “Examples” vignette as well as the code we wrote for the logistic regression above as a template, write {parsnip}
code that would define a simple (an engine of"lm"
) linear regression model. Begin by making a new Markdown header and code chunk at the bottom of the doc/lesson.qmd
file, like listed below:
## Exercises
### Linear regression model definition
```{r}
```
After writing the code, style the file using the Palette (, then type “style file”). We will eventually delete these exercise text in the R Markdown file, but for now, commit the changes to the Git history.
Click for the solution. Only click if you are struggling or are out of time.
<- linear_reg() %>%
linear_reg_specs set_engine("lm")
7.7 Data transformations specific to modeling
Setting the model type was pretty easy right? The more difficult part comes next with the data transformations. {recipes}
functions are almost entirely used to apply transformations that a model might specifically need, like mean-centering, removing missing values, and other aspects of data processing.
Let’s consider our lipidomics
dataset. In order for us to start our statistical analysis, we need the data to be structured in a certain way to be able to smoothly use it as input in our model. We have at least three easy observations on necessary transformations of the data, two of which ca be fixed with a single {tidyr}
function, while the third one can be fixed with {recipes}
. Can you spot them?
lipidomics
# A tibble: 504 × 6
code gender age class metabolite value
<chr> <chr> <dbl> <chr> <chr> <dbl>
1 ERI109 M 25 CT TMS (interntal standard) 208.
2 ERI109 M 25 CT Cholesterol 19.8
3 ERI109 M 25 CT Lipid CH3- 1 44.1
4 ERI109 M 25 CT Lipid CH3- 2 147.
5 ERI109 M 25 CT Cholesterol 27.2
6 ERI109 M 25 CT Lipid -CH2- 587.
7 ERI109 M 25 CT FA -CH2CH2COO- 31.6
8 ERI109 M 25 CT PUFA 29.0
9 ERI109 M 25 CT Phosphatidylethanolamine 6.78
10 ERI109 M 25 CT Phosphatidycholine 41.7
# ℹ 494 more rows
Ask them if they can spot these differences. Give them a few minutes to think and respond.
The first observation isn’t always an issue and depends heavily on the model type you use. Since we are using logistic regression, the model assumes that each row is an individual person. But our data is in the long format, so each person has multiple rows. The second observation is that there seems to be a data input error, since there are three Cholesterol
values, while all other metabolites only have one:
%>%
lipidomics count(code, metabolite) %>%
filter(n > 1)
# A tibble: 36 × 3
code metabolite n
<chr> <chr> <int>
1 ERI109 Cholesterol 3
2 ERI111 Cholesterol 3
3 ERI140 Cholesterol 3
4 ERI142 Cholesterol 3
5 ERI143 Cholesterol 3
6 ERI144 Cholesterol 3
7 ERI145 Cholesterol 3
8 ERI146 Cholesterol 3
9 ERI147 Cholesterol 3
10 ERI149 Cholesterol 3
# ℹ 26 more rows
We can fix both the long format and multiple cholesterol issues by using tidyr::pivot_wider()
. Before we do, the last issue is that each metabolite has quite large differences in the values and ranges of data. Again, whether this is an issue depends on what we want to do, but in our research question we want to know how each metabolite influences T1D. In order to best interpret the results and compare across metabolites, we should ideally have all the metabolites with a similar range and distribution of values.
Let’s fix the first two issues first. While we probably only need to use pivot_wider()
, we should probably first tidy up the metabolite names first so they make better column names. We do that by combining mutate()
with snakecase::to_snake_case()
. In the doc/lesson.qmd
file, we rename the metabolite names before using pivot_wider()
. Since we want an easy way of identifying columns that are metabolites, we will add a "metabolite_"
prefix using the argument names_prefix
. To actually fix the multiple cholesterol issue, we should look more into the data documentation or contact the authors. But for this course, we will merge the values by calculating a mean before pivoting. We do this by setting the values_fn
with mean
.
<- lipidomics %>%
lipidomics_wide mutate(metabolite = snakecase::to_snake_case(metabolite)) %>%
pivot_wider(
names_from = metabolite,
values_from = value,
values_fn = mean,
names_prefix = "metabolite_"
) lipidomics_wide
# A tibble: 36 × 16
code gender age class metabolite_tms_interntal_standard
<chr> <chr> <dbl> <chr> <dbl>
1 ERI109 M 25 CT 208.
2 ERI111 M 39 CT 219.
3 ERI163 W 58 CT 57.1
4 ERI375 M 24 CT 19.2
5 ERI376 M 26 CT 35.4
6 ERI391 M 31 CT 30.4
7 ERI392 M 24 CT 21.7
8 ERI79 W 26 CT 185.
9 ERI81 M 52 CT 207.
10 ERI83 M 25 CT 322.
# ℹ 26 more rows
# ℹ 11 more variables: metabolite_cholesterol <dbl>,
# metabolite_lipid_ch_3_1 <dbl>, metabolite_lipid_ch_3_2 <dbl>,
# metabolite_lipid_ch_2 <dbl>, metabolite_fa_ch_2_ch_2_coo <dbl>,
# metabolite_pufa <dbl>,
# metabolite_phosphatidylethanolamine <dbl>,
# metabolite_phosphatidycholine <dbl>, …
Since we’re using a function-oriented workflow and since we will be using this code again later on, let’s convert both the “metabolite to snakecase” and “pivot to wider” code into their own functions, before moving them over into the R/functions.R
file.
<- function(data) {
metabolite_values_to_snakecase %>%
data ::mutate(metabolite = snakecase::to_snake_case(metabolite))
dplyr
}%>%
lipidomics metabolite_values_to_snakecase()
# A tibble: 504 × 6
code gender age class metabolite value
<chr> <chr> <dbl> <chr> <chr> <dbl>
1 ERI109 M 25 CT tms_interntal_standard 208.
2 ERI109 M 25 CT cholesterol 19.8
3 ERI109 M 25 CT lipid_ch_3_1 44.1
4 ERI109 M 25 CT lipid_ch_3_2 147.
5 ERI109 M 25 CT cholesterol 27.2
6 ERI109 M 25 CT lipid_ch_2 587.
7 ERI109 M 25 CT fa_ch_2_ch_2_coo 31.6
8 ERI109 M 25 CT pufa 29.0
9 ERI109 M 25 CT phosphatidylethanolamine 6.78
10 ERI109 M 25 CT phosphatidycholine 41.7
# ℹ 494 more rows
This on its own should work. However, the column we want to change might not always be called metabolite
, or we might want to change it later. So, to make this function a bit more generic, we can use something called “curly-curly” (it looks like {{}}
when used) and “non-standard evaluation” (NSE).
When you write your own functions that make use of functions in the {tidyverse}
, you may eventually encounter an error that might not be very easy to figure out. Here’s a very simple example using select()
, where one of your function’s arguments is to select columns:
<- function(data, column) {
test_nse %>%
data ::select(column)
dplyr
}
%>%
lipidomics test_nse(class)
Error in `dplyr::select()`:
! Can't subset columns that don't exist.
✖ Column `column` doesn't exist.
The error occurs because of something called “non-standard evaluation” (or NSE). NSE is a major feature of R and is used quite a lot throughout R. NSE is used a lot in the {tidyverse}
packages. It’s one of the first things computer scientists complain about when they use R, because it is not a common thing in other programming languages. But NSE is what allows you to use formulas (e.g. y ~ x + x2
in modeling, which we will show shortly) or allows you to type out select(class, age)
or library(purrr)
. In “standard evaluation”, these would instead be select("Gender", "BMI")
or library("purrr")
. So NSE gives flexibility and ease of use for the user (we don’t have to type quotes every time) when doing data analysis, but can give some headaches when programming in R, like when making functions. There’s more detail about this on the dplyr website, which lists some options to handle NSE while programming. The easiest approach is to wrap the argument with “curly-curly” ({{}}
).
<- function(data, columns) {
test_nse %>%
data ::select({{ columns }})
dplyr
}
%>%
lipidomics test_nse(class)
# A tibble: 504 × 1
class
<chr>
1 CT
2 CT
3 CT
4 CT
5 CT
6 CT
7 CT
8 CT
9 CT
10 CT
# ℹ 494 more rows
%>%
lipidomics test_nse(c(class, age))
# A tibble: 504 × 2
class age
<chr> <dbl>
1 CT 25
2 CT 25
3 CT 25
4 CT 25
5 CT 25
6 CT 25
7 CT 25
8 CT 25
9 CT 25
10 CT 25
# ℹ 494 more rows
You don’t need to go over what they read, you can continue with making the function below. Unless learners have some questions.
We can use curly-curly (combined with across()
) to apply snakecase::to_snake_case()
to columns of our choice.
<- function(data, cols) {
column_values_to_snakecase %>%
data ::mutate(dplyr::across({{ cols }}, snakecase::to_snake_case))
dplyr
}
%>%
lipidomics column_values_to_snakecase(metabolite)
# A tibble: 504 × 6
code gender age class metabolite value
<chr> <chr> <dbl> <chr> <chr> <dbl>
1 ERI109 M 25 CT tms_interntal_standard 208.
2 ERI109 M 25 CT cholesterol 19.8
3 ERI109 M 25 CT lipid_ch_3_1 44.1
4 ERI109 M 25 CT lipid_ch_3_2 147.
5 ERI109 M 25 CT cholesterol 27.2
6 ERI109 M 25 CT lipid_ch_2 587.
7 ERI109 M 25 CT fa_ch_2_ch_2_coo 31.6
8 ERI109 M 25 CT pufa 29.0
9 ERI109 M 25 CT phosphatidylethanolamine 6.78
10 ERI109 M 25 CT phosphatidycholine 41.7
# ℹ 494 more rows
Move this new function over into the R/functions.R
file, add Roxygen comments (have the cursor inside the function, type Ctrl-Shift-P
, then type “roxygen”), style using the Palette (, then type “style file”), source()
the modified R/functions.R
file, and add the new function above the pivot_wider()
code in the doc/lessons.Rmd
file.
#' Convert column value strings into snakecase.
#'
#' @param data Data with string columns.
#' @param cols The column to convert into snakecase.
#'
#' @return A data frame.
#'
<- function(data, cols) {
column_values_to_snakecase %>%
data ::mutate(dplyr::across({{ cols }}, snakecase::to_snake_case))
dplyr }
7.8 Exercise: Convert the pivot code into a function
Time: ~10 minutes.
Just like with the mutate()
, take the pivot_wider()
code and convert it into a new function.
- Name the new function
metabolites_to_wider
. - Include two arguments in the new
function()
:data
andvalues_fn
. Set the default forvalues_fn
to bemean
. We add this argument in case we want to merge the duplicate cholesterol variables with something other than the mean. - Use
data %>%
at the beginning, like we did with thecolumn_values_to_snakecase()
. Inside thepivot_wider()
code, replacevalues_fn = mean
withvalues_fn = values_fn
. - Use
tidyr::
before thepivot_wider()
function. - Add Roxygen comments (have the cursor inside the function, type
Ctrl-Shift-P
, then type “roxygen”). - Move the function into the
R/functions.R
file. - Replace the code in the
doc/lesson.qmd
file to make use of the new functions.
Click for the solution. Only click if you are struggling or are out of time.
#' Convert the metabolite long format into a wider one.
#'
#' @param data The lipidomics dataset.
#' @param values_fn A function to summarize the multiple cholesterol values.
#'
#' @return A wide data frame.
#'
<- function(data, values_fn = mean) {
metabolites_to_wider %>%
data ::pivot_wider(
tidyrnames_from = metabolite,
values_from = value,
values_fn = values_fn,
names_prefix = "metabolite_"
) }
7.9 Using recipes to manage transformations
We’ve used {dplyr}
and {tidyr}
to start fixing some of the issues with the data. But we still have the third issue: How to make the results between metabolites comparable. That’s where we use {recipes}
.
The first function is recipe()
and it takes two forms: with or without a formula. Remember the model formula we mentioned previously? Well, here is where we can use it to tell {recipes}
about the model formula we intend to use so it knows on what variables to apply the chosen transformations.
For a few reasons that will be clearer later, we won’t ultimately use the formula form of recipe()
, but will show how it works. The
recipe(class ~ metabolite_cholesterol + age + gender, data = lipidomics_wide)
── Recipe ──────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 3
The alternative approach is to set “roles” using update_roles()
. Instead of using a formula and letting recipe()
infer the outcome and predictors, we can explicitly select which variables are which. This has some nice features that we will use later on.
recipe(lipidomics_wide) %>%
update_role(metabolite_lipid_ch_3_1, age, gender, new_role = "predictor") %>%
update_role(class, new_role = "outcome")
── Recipe ──────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 3
undeclared role: 12
The next “step” is to select a transformation function.
7.10 Exercise: Which transformations make the most sense?
Time: ~15 minutes.
Look at the list of step_*
functions below and use the ?
or F1 (while having the cursor on the function name) to access the help documentation. Consider the metabolite data in the lipidomics
dataset. Which of these transformations might you use?
::step_log()
recipes::step_scale()
recipes::step_normalize()
recipes::step_center()
recipes::step_sqrt() recipes
- With your neighbour (or group), justify which
step_
transformation you might use for the numeric metabolite data. - In the last 2 minutes of the exercise, we will all share our thoughts.
There are so many useful transformation functions available. For instance, if you often have to impute data, there are functions for that. You can check them out in the Console by typing recipes::step_impute_
then hit the Tab key to see a list of them. Or, if you have some missing values, there’s also the recipes::step_naomit()
.
The step_
function we use in the text of this website in later sections may be different from what you decide on in your group and in the class as a whole. There isn’t strictly a “right” answer here, since it would ultimately require domain expertise in both lipidomic quantification and statistical analysis of -omic data. But we ultimate need to show and use something in the text.
7.11 Creating a transformation “recipe”
You will use whichever transformation function your group decided on in the exercise above, but for the text of this website, we will use step_normalize()
. This function is useful because it makes each variable centered to zero and a value of 1 unit is translated to 1 standard deviation of the original distribution. This means we can more easily compare values between variables. If we add this to the end of the recipe:
recipe(lipidomics_wide) %>%
update_role(metabolite_lipid_ch_3_1, age, gender, new_role = "predictor") %>%
update_role(class, new_role = "outcome") %>%
step_normalize(starts_with("metabolite_"))
── Recipe ──────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 3
undeclared role: 12
── Operations
• Centering and scaling for: starts_with("metabolite_")
Next thing to do is convert this into a function, using the same workflow we’ve been using (which means this needs to be in the R/functions.R
script). We’ll also use the curly-curly again, since we might use a different metabolite later. Note, when adding all the packagename::
to each function, the starts_with()
function comes from the {tidyselect}
package.
#' A transformation recipe to pre-process the data.
#'
#' @param data The lipidomics dataset.
#' @param metabolite_variable The column of the metabolite variable.
#'
#' @return
#'
<- function(data, metabolite_variable) {
create_recipe_spec ::recipe(data) %>%
recipes::update_role({{ metabolite_variable }}, age, gender, new_role = "predictor") %>%
recipes::update_role(class, new_role = "outcome") %>%
recipes::step_normalize(tidyselect::starts_with("metabolite_"))
recipes }
And test it out:
<- lipidomics_wide %>%
recipe_specs create_recipe_spec(metabolite_cholesterol)
recipe_specs
── Recipe ──────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 3
undeclared role: 12
── Operations
• Centering and scaling for: tidyselect::starts_with("metabolite_")
Style using the Palette (, then type “style file”), then commit the changes made to the Git history.
7.12 Fitting the model by combining the recipe, model definition, and data
We’ve now defined the model we want to use and created a transformation {recipes}
specification. Now we can start putting them together and finally fit them to the data. This is done with the {workflows}
package.
Why use this package, rather than simply run the statistical analysis and process the data as normal? When running multiple models (like we will do in the next section) that may require different data structures, the data transformation steps have to happen right before the data is fit to the model and need to be done on exactly the data used by the model. So if we have one data frame that we run multiple models on, but the transformation happens to the whole data frame, we could end up with issues due to how the transformations were applied. The {workflows}
package keeps track of those things for us, so we can focus on the higher level thinking rather than on the small details of running the models.
The {workflows}
package has a few main functions for combining the recipe with the model specs, as well as several for updating an existing workflow (which might be useful if you need to run many models of slightly different types). All model workflows need to start with workflow()
, followed by two main functions: add_model()
and add_recipe()
. Can you guess what they do?
workflow() %>%
add_model(log_reg_specs) %>%
add_recipe(recipe_specs)
══ Workflow ════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────
1 Recipe Step
• step_normalize()
── Model ───────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Computational engine: glm
While this code is already pretty concise, let’s convert it into a function to make it simplified. We’ll use the same function-oriented workflow that we’ve used before, where the function should ultimately be inside the R/functions.R
file.
#' Create a workflow object of the model and transformations.
#'
#' @param model_specs The model specs
#' @param recipe_specs The recipe specs
#'
#' @return A workflow object
#'
<- function(model_specs, recipe_specs) {
create_model_workflow ::workflow() %>%
workflows::add_model(model_specs) %>%
workflows::add_recipe(recipe_specs)
workflows }
Instead of using the previously created objects, let’s start the model creation from scratch:
<- create_model_workflow(
model_workflow logistic_reg() %>%
set_engine("glm"),
%>%
lipidomics_wide create_recipe_spec(metabolite_cholesterol)
) model_workflow
══ Workflow ════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────
1 Recipe Step
• step_normalize()
── Model ───────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Computational engine: glm
Now, we can do the final thing: Fitting the data to the model with fit()
! 🙌
<- model_workflow %>%
fitted_model fit(lipidomics_wide)
fitted_model
══ Workflow [trained] ══════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────
1 Recipe Step
• step_normalize()
── Model ───────────────────────────────────────────────────────────
Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept) genderW
0.1053 -0.7066
age metabolite_cholesterol
0.0069 1.0880
Degrees of Freedom: 35 Total (i.e. Null); 32 Residual
Null Deviance: 49.9
Residual Deviance: 42.8 AIC: 50.8
This gives us a lot of information, but what we are mostly interested in is the model estimates themselves. While this fitted_model
object contains a lot additional information inside, {workflows}
thankfully has a function to extract the information we want. In this case, it is the extract_fit_parsnip()
function.
%>%
fitted_model extract_fit_parsnip()
parsnip model object
Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept) genderW
0.1053 -0.7066
age metabolite_cholesterol
0.0069 1.0880
Degrees of Freedom: 35 Total (i.e. Null); 32 Residual
Null Deviance: 49.9
Residual Deviance: 42.8 AIC: 50.8
To get this information in a tidier format, we use another function: tidy()
. This function comes from the {broom}
package, which is part of the {tidymodels}
. But we should explicitly add it to the dependencies:
use_package("broom")
Then, we add the tidy()
function to our model using the %>%
pipe. Since we are using a logistic regression model, we need to consider how we want the estimates to be presented, probably depending on how we want to visualize our results. If we set exponentiate = TRUE
in tidy()
, the output estimates will be odds ratios, if we set exponentiate = FALSE
, we will get the log odds ratios or the beta coefficient. Here we choose exponentiate = TRUE
:
%>%
fitted_model extract_fit_parsnip() %>%
tidy(exponentiate = TRUE)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.11 1.29 0.0817 0.935
2 genderW 0.493 0.779 -0.907 0.365
3 age 1.01 0.0377 0.183 0.855
4 metabolite_cholesterol 2.97 0.458 2.38 0.0175
We now have a data frame of our model results! Like we did with the workflows()
code that we converted into a function, we do the same thing here: Make another function (and move it to R/functions.R
)! 😛
#' Create a tidy output of the model results.
#'
#' @param workflow_fitted_model The model workflow object that has been fitted.
#'
#' @return A data frame.
#'
<- function(workflow_fitted_model) {
tidy_model_output %>%
workflow_fitted_model ::extract_fit_parsnip() %>%
workflows::tidy(exponentiate = TRUE)
broom }
Replacing the code in the doc/lesson.qmd
file to use the function.
%>%
fitted_model tidy_model_output()
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.11 1.29 0.0817 0.935
2 genderW 0.493 0.779 -0.907 0.365
3 age 1.01 0.0377 0.183 0.855
4 metabolite_cholesterol 2.97 0.458 2.38 0.0175
If we revise the code so it is one pipe, it would look like:
create_model_workflow(
logistic_reg() %>%
set_engine("glm"),
%>%
lipidomics_wide create_recipe_spec(metabolite_cholesterol)
%>%
) fit(lipidomics_wide) %>%
tidy_model_output()
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.11 1.29 0.0817 0.935
2 genderW 0.493 0.779 -0.907 0.365
3 age 1.01 0.0377 0.183 0.855
4 metabolite_cholesterol 2.97 0.458 2.38 0.0175
Let’s briefly cover what these columns and values mean.
If you want, you can go over these details briefly or in more detail, depending on how comfortable you are. Or you can get them to read it only.
Let’s explain this output a bit, each column at a time:
term
: If you recall the formula \(class = metabolite + sex + gender\), you’ll see all but theclass
object there in the columnterm
. This column contains all the predictor variables, including the intercept (from the original model).estimate
: This column is the “coefficient” linked to the term in the model. The final mathematical model here looks like:\[ \displaylines{class = Intercept + (metabolite\_estimate \times metabolite\_value) + \\ (gender\_estimate \times gender\_value) + ...}\]
In our example, we chose to get the odds ratios. In the mathematical model above, the estimate is represented as the log odds ratio or beta coefficient - the constant value you multiply the value of the term with. Interpreting each of these values can be quite tricky and can take a surprising amount of time to conceptually break down, so we won’t do that here, since this isn’t a statistics course. The only thing you need to understand here is that the
estimate
is the value that tells us the magnitude of association between the term andclass
. This value, along with thestd.error
are the most important values we can get from the model and we will be using them when presenting the results.std.error
: This is the uncertainty in theestimate
value. A higher value means there is less certainty in the value of theestimate
.statistic
: This value is used to, essentially, calculate thep.value
.p.value
: This is the infamous value we researchers go crazy for and think nothing else of. While there is a lot of attention to this single value, we tend to give it more attention than warranted. The interpretation of the p-value is even more difficult than theestimate
and again, we won’t cover this in this course. We won’t be using this value at all in presenting the results.
Before ending, open the Git interface and commit the changes you made. Then push your changes up to GitHub.
7.13 Summary
- Create research questions that (ideally) are structured in a way to mimic how the statistical analysis will be done, preferably in a “formula” style like \(y = x1 + x2 + ... + error\) and in a diagram style with links connecting variables.
- Statistical analyses, while requiring some trial and error, are surprisingly structured in the workflow and steps taken. Use this structure to help guide you in completing tasks related to running analyses.
- Use
{parsnip}
functions to define the model you want to use, likelogistic_reg()
for logistic regression, and set the computational “engine” withset_engine()
. - Use
{recipes}
functions to set up the data transformation steps necessary to effectively run the statistical analysis, like adding variable “roles” (outcome vs predictor) usingupdate_roles()
and adding transformation steps using any of the dozen differentstep_
functions. - Use
{workflows}
functions to develop an analysisworkflow()
that combines the defined model withadd_model()
, the transformation steps withadd_recipe()
, and the data withfit()
. - Use
{broom}
totidy()
the model output, extracted usingextract_fit_parsnip()
to get a data frame of the estimates and standard error for the variables in the model.