**Want to help out or contribute?**

If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.

On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.

# 7 A general approach to doing statistical analyses

Running statistical analyses is a relatively methodical and well-defined process, even if there is often a lot of trial and error involved. Sometimes it may feel overwhelming and complicated, which it definitely can be, but it doesn’t have to be. By taking a structured approach to running statistical analyses, you can make it easier on yourself and feel more in control.

In R, statistical methods are often created and developed by researchers with little to no training in software development and who often use them differently. This has some advantages, like having the cutting edge statistical methods available to us, but has a major disadvantage of often having to learn a completely different way of running a statistical analysis, even if it is fairly similar to ones you’ve used before. So having a framework for running statistical analyses, regardless of who created them, can provide that needed structure and vastly simplify the analysis. This session will be covering a general framework for running statistical analyses, regardless of the exact statistical method.

## 7.1 Learning objectives

The overall objective for this session is to:

- Describe the basic framework underlying most statistical analyses and use R to generate statistical results using this framework.

More specific objectives are to:

- Describe the general “workflow” and steps involved in stating the research question, constructing a model to help answer the question, preparing the data to match the requirements of the model, fitting it to the data, and finally extracting the results from the fitted model.
- Categorize the model definition step as a distinct, theory-driven step, separate from the data, and use parsnip functions to help with defining your model.
- Identify various data transformation techniques and evaluate which are good options given the data. Use functions in the recipes package to apply these transformations to your data.
- Use the broom package to extract the model results to later present them in graphical format (with ggplot2).
- Continue applying the concepts and functions used from the previous sessions.

Specific “anti”-objectives:

- Will
**not**know how to choose and apply the appropriate statistical model or test, nor understand any statistical theory, nor interpret the statistical results correctly, nor determine the relevant data transformations for the statistical approach. What we show, we show*only as demonstration purposes only*, they could be entirely wrong in how to do them correctly if an expert were to review them.

We will be making *a lot* of function throughout this session and the next. This is just a fair warning!

## 7.2 Exercise: What does a “model” mean?

Time: ~6 minutes.

In science and especially statistics, we talk a lot about “models”. But what does model actually mean? What different types of definitions can you think of? Is there a different understanding of model in statistics compared to other areas?

- Take 1 minute to think about your understanding of a model.
- Then, over the next 3 minutes, discuss with your neighbour about the meaning of “model” and see if you can come to a shared understanding.
- Finally, over the next 2 minutes, we will share all together what a model is in the context of data analysis.

## 7.3 Theory and “workflow” on statistical modeling

## 7.4 Defining the model

Let’s switch to working in the `doc/learning.qmd`

file to create those logistic regression models. Since we’ve already created some items in the targets pipeline, we’ll need to tell the Quarto file where to find this “store” of outputs, and import the data using `tar_read()`

. We also need to add `library(tidymodels)`

to the `setup`

code chunk. Copy and paste this code chunk below into the Quarto file.

```
```{r setup}
::tar_config_set(store = here::here("_targets"))
targetslibrary(tidyverse)
library(targets)
library(tidymodels)
source(here::here("R/functions.R"))
<- tar_read(lipidomics)
lipidomics ```
```

Since we will be using tidymodels, we need to install it, as well as explicitly add the parsnip, recipes, and workflows packages. Like tidyverse, we need to set tidymodels differently because it is a “meta-package”. We might need to force installing it with `pak::pak("tidymodels")`

so renv recognizes it.

```
use_package("tidymodels", "depends")
# pak::pak("tidymodels")
use_package("parsnip")
use_package("recipes")
use_package("workflows")
```

Before continuing, let’s **commit** the changes to the Git history with ` or with the Palette (``, then type “commit”). Next, in the ``doc/learning.qmd`

file, on the bottom of the document create a new header and code chunk:

```
## Building the model
```{r}
```
```

In the new code chunk, we will set up the model specs:

```
log_reg_specs <- logistic_reg() %>%
set_engine("glm")
log_reg_specs
```

```
#> Logistic Regression Model Specification (classification)
#>
#> Computational engine: glm
```

Running this on it’s own doesn’t show much, as you can see. But we’ve now set the model we want to use.

## 7.5 Data transformations specific to modeling

Setting the model type was pretty easy right? The more difficult part comes next with the data transformations. recipes functions are almost entirely used to apply transformations that a model might specifically need, like mean-centering, removing missing values, and other aspects of data processing.

Let’s consider our `lipidomics`

dataset. In order for us to start our statistical analysis, we need the data to be structured in a certain way to be able to smoothly use it as input in our model. We have at least three easy observations on necessary transformations of the data, two of which can be fixed with a single tidyr function, while the third one can be fixed with recipes. Can you spot them?

`lipidomics`

```
#> # A tibble: 504 × 6
#> code gender age class metabolite value
#> <chr> <chr> <dbl> <chr> <chr> <dbl>
#> 1 ERI109 M 25 CT TMS (interntal standard) 208.
#> 2 ERI109 M 25 CT Cholesterol 19.8
#> 3 ERI109 M 25 CT Lipid CH3- 1 44.1
#> 4 ERI109 M 25 CT Lipid CH3- 2 147.
#> 5 ERI109 M 25 CT Cholesterol 27.2
#> 6 ERI109 M 25 CT Lipid -CH2- 587.
#> 7 ERI109 M 25 CT FA -CH2CH2COO- 31.6
#> 8 ERI109 M 25 CT PUFA 29.0
#> 9 ERI109 M 25 CT Phosphatidylethanolamine 6.78
#> 10 ERI109 M 25 CT Phosphatidycholine 41.7
#> # ℹ 494 more rows
```

The first observation isn’t always an issue and depends heavily on the model type you use. Since we are using logistic regression, the model assumes that each row is an individual person. But our data is in the long format, so each person has multiple rows. The second observation is that there seems to be a data input error, since there are three `Cholesterol`

values, while all other metabolites only have one:

```
#> # A tibble: 36 × 3
#> code metabolite n
#> <chr> <chr> <int>
#> 1 ERI109 Cholesterol 3
#> 2 ERI111 Cholesterol 3
#> 3 ERI140 Cholesterol 3
#> 4 ERI142 Cholesterol 3
#> 5 ERI143 Cholesterol 3
#> 6 ERI144 Cholesterol 3
#> 7 ERI145 Cholesterol 3
#> 8 ERI146 Cholesterol 3
#> 9 ERI147 Cholesterol 3
#> 10 ERI149 Cholesterol 3
#> # ℹ 26 more rows
```

We can fix both the long format and multiple cholesterol issues by using `tidyr::pivot_wider()`

. Before we do, the last issue is that each metabolite has quite large differences in the values and ranges of data. Again, whether this is an issue depends on what we want to do, but in our research question we want to know how each metabolite influences T1D. In order to best interpret the results and compare across metabolites, we should ideally have all the metabolites with a similar range and distribution of values.

Let’s fix the first two issues first. While we probably only need to use `pivot_wider()`

, we should probably first tidy up the metabolite names first so they make better column names. We do that by combining `mutate()`

with `snakecase::to_snake_case()`

. In the `doc/learning.qmd`

file, we rename the metabolite names before using `pivot_wider()`

. Since we want an easy way of identifying columns that are metabolites, we will add a `"metabolite_"`

prefix using the argument `names_prefix`

. To actually fix the multiple cholesterol issue, we should look more into the data documentation or contact the authors. But for this course, we will merge the values by calculating a mean before pivoting. We do this by setting the `values_fn`

with `mean`

.

```
lipidomics_wide <- lipidomics %>%
mutate(metabolite = snakecase::to_snake_case(metabolite)) %>%
pivot_wider(
names_from = metabolite,
values_from = value,
values_fn = mean,
names_prefix = "metabolite_"
)
lipidomics_wide
```

```
#> # A tibble: 36 × 16
#> code gender age class metabolite_tms_interntal_standard
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 ERI109 M 25 CT 208.
#> 2 ERI111 M 39 CT 219.
#> 3 ERI163 W 58 CT 57.1
#> 4 ERI375 M 24 CT 19.2
#> 5 ERI376 M 26 CT 35.4
#> 6 ERI391 M 31 CT 30.4
#> 7 ERI392 M 24 CT 21.7
#> 8 ERI79 W 26 CT 185.
#> 9 ERI81 M 52 CT 207.
#> 10 ERI83 M 25 CT 322.
#> # ℹ 26 more rows
#> # ℹ 11 more variables: metabolite_cholesterol <dbl>,
#> # metabolite_lipid_ch_3_1 <dbl>, metabolite_lipid_ch_3_2 <dbl>,
#> # metabolite_lipid_ch_2 <dbl>, metabolite_fa_ch_2_ch_2_coo <dbl>,
#> # metabolite_pufa <dbl>,
#> # metabolite_phosphatidylethanolamine <dbl>,
#> # metabolite_phosphatidycholine <dbl>, …
```

Since we’re using a function-oriented workflow and since we will be using this code again later on, let’s convert both the “metabolite to snakecase” and “pivot to wider” code into their own functions, before moving them over into the `R/functions.R`

file.

```
column_values_to_snake_case <- function(data) {
data %>%
dplyr::mutate(metabolite = snakecase::to_snake_case(metabolite))
}
lipidomics %>%
column_values_to_snake_case()
```

```
#> # A tibble: 504 × 6
#> code gender age class metabolite value
#> <chr> <chr> <dbl> <chr> <chr> <dbl>
#> 1 ERI109 M 25 CT tms_interntal_standard 208.
#> 2 ERI109 M 25 CT cholesterol 19.8
#> 3 ERI109 M 25 CT lipid_ch_3_1 44.1
#> 4 ERI109 M 25 CT lipid_ch_3_2 147.
#> 5 ERI109 M 25 CT cholesterol 27.2
#> 6 ERI109 M 25 CT lipid_ch_2 587.
#> 7 ERI109 M 25 CT fa_ch_2_ch_2_coo 31.6
#> 8 ERI109 M 25 CT pufa 29.0
#> 9 ERI109 M 25 CT phosphatidylethanolamine 6.78
#> 10 ERI109 M 25 CT phosphatidycholine 41.7
#> # ℹ 494 more rows
```

This on its own should work. *However*, the column we want to change might not always be called `metabolite`

, or we might want to change it later. So, to make this function a bit more generic, we can use something called “curly-curly” (it looks like `{{}}`

when used) and “non-standard evaluation” (NSE).

We can use curly-curly (combined with `across()`

) to apply `snakecase::to_snake_case()`

to columns of our choice.

```
column_values_to_snake_case <- function(data, columns) {
data %>%
dplyr::mutate(dplyr::across({{ columns }}, snakecase::to_snake_case))
}
lipidomics %>%
column_values_to_snake_case(metabolite)
```

```
#> # A tibble: 504 × 6
#> code gender age class metabolite value
#> <chr> <chr> <dbl> <chr> <chr> <dbl>
#> 1 ERI109 M 25 CT tms_interntal_standard 208.
#> 2 ERI109 M 25 CT cholesterol 19.8
#> 3 ERI109 M 25 CT lipid_ch_3_1 44.1
#> 4 ERI109 M 25 CT lipid_ch_3_2 147.
#> 5 ERI109 M 25 CT cholesterol 27.2
#> 6 ERI109 M 25 CT lipid_ch_2 587.
#> 7 ERI109 M 25 CT fa_ch_2_ch_2_coo 31.6
#> 8 ERI109 M 25 CT pufa 29.0
#> 9 ERI109 M 25 CT phosphatidylethanolamine 6.78
#> 10 ERI109 M 25 CT phosphatidycholine 41.7
#> # ℹ 494 more rows
```

Move this new function over into the `R/functions.R`

file, add Roxygen documentation with ` or with the Palette (``, then type “roxygen comment”), style using the Palette (``, then type “style file”), ``source()`

the modified `R/functions.R`

file with ` or with the Palette (``, then type “source”), and add the new function above the ``pivot_wider()`

code in the `doc/learning.qmd`

file.

```
#' Convert a column's character values to snakecase format.
#'
#' @param data The lipidomics dataset.
#' @param columns The column you want to convert into the snakecase format.
#'
#' @return A data frame.
#'
column_values_to_snake_case <- function(data, columns) {
data %>%
dplyr::mutate(dplyr::across({{ columns }}, snakecase::to_snake_case))
}
```

## 7.6 Exercise: Convert the pivot code into a function

Time: ~10 minutes.

Just like with the `mutate()`

, take the `pivot_wider()`

code and convert it into a new function.

- Name the new function
`metabolites_to_wider`

. - Include one argument in the new
`function()`

:`data`

. - Use
`data %>%`

at the beginning, like we did with the`column_values_to_snake_case()`

. - Use
`tidyr::`

before the`pivot_wider()`

function. - Add the Roxygen documentation with
`or with the Palette (``, then type “roxygen comment”).` - Move the function into the
`R/functions.R`

file. - Replace the code in the
`doc/learning.qmd`

file to make use of the new functions.

**Click for the solution**. Only click if you are struggling or are out of time.

```
#' Convert the metabolite long format into a wider one.
#'
#' @param data The lipidomics dataset.
#'
#' @return A wide data frame.
#'
metabolites_to_wider <- function(data) {
data %>%
tidyr::pivot_wider(
names_from = metabolite,
values_from = value,
values_fn = mean,
names_prefix = "metabolite_"
)
}
```

## 7.7 Using recipes to manage transformations

We’ve used dplyr and tidyr to start fixing some of the issues with the data. But we still have the third issue: How to make the results between metabolites comparable. That’s where we use recipes.

The first function is `recipe()`

and it takes two forms: with or without a formula. Remember the model formula we mentioned previously? Well, here is where we can use it to tell recipes about the model formula we intend to use so it knows on what variables to apply the chosen transformations.

For a few reasons that will be clearer later, we won’t ultimately use the formula form of `recipe()`

, but will show how it works. The

`recipe(class ~ metabolite_cholesterol + age + gender, data = lipidomics_wide)`

The alternative approach is to set “roles” using `update_roles()`

. Instead of using a formula and letting `recipe()`

infer the outcome and predictors, we can explicitly select which variables are which. This has some nice features that we will use later on.

```
recipe(lipidomics_wide) %>%
update_role(metabolite_cholesterol, age, gender, new_role = "predictor") %>%
update_role(class, new_role = "outcome")
```

The next “step” is to select a transformation function. There are many `step_*`

functions (some shown below) that are available in recipes and which one you decide to use depends heavily on your data and your research question. So take your time thinking through which transformations you actually need for your analysis, rather than quickly using one or more of them.

```
recipes::step_log()
recipes::step_scale()
recipes::step_normalize()
recipes::step_center()
recipes::step_sqrt()
```

There are so many useful transformation functions available. For instance, if you often have to impute data, there are functions for that. You can check them out in the Console by typing `recipes::step_impute_`

then hit the Tab key to see a list of them. Or, if you have some missing values, there’s also the `recipes::step_naomit()`

.

There are many transformations we could use for the `lipidomics`

dataset, but we will use `step_normalize()`

for this course. This function is useful because it makes each variable centered to zero and a value of 1 unit is translated to 1 standard deviation of the original distribution. This means we can more easily compare values between variables. We can add this to the end of the recipe:

```
recipe(lipidomics_wide) %>%
update_role(metabolite_cholesterol, age, gender, new_role = "predictor") %>%
update_role(class, new_role = "outcome") %>%
step_normalize(starts_with("metabolite_"))
```

Next thing to do is convert this into a function, using the same workflow we’ve been using (which means this needs to be in the `R/functions.R`

script). We’ll also use the curly-curly again, since we might use a different metabolite later. Note, when adding all the `packagename::`

to each function, the `starts_with()`

function comes from the tidyselect package.

```
#' A transformation recipe to pre-process the data.
#'
#' @param data The lipidomics dataset.
#' @param metabolite_variable The column of the metabolite variable.
#'
#' @return
#'
create_recipe_spec <- function(data, metabolite_variable) {
recipes::recipe(data) %>%
recipes::update_role({{ metabolite_variable }}, age, gender, new_role = "predictor") %>%
recipes::update_role(class, new_role = "outcome") %>%
recipes::step_normalize(tidyselect::starts_with("metabolite_"))
}
```

And test it out:

```
recipe_specs <- lipidomics_wide %>%
create_recipe_spec(metabolite_cholesterol)
recipe_specs
```

Style using the Palette (`, then type “style file”), then commit the changes made to the Git history with `` or with the Palette (``, then type “commit”).`

## 7.8 Fitting the model by combining the recipe, model definition, and data

We’ve now defined the model we want to use and created a transformation recipes specification. Now we can start putting them together and finally fit them to the data. This is done with the workflows package.

Why use this package, rather than simply run the statistical analysis and process the data as normal? When running multiple models (like we will do in the next section) that may require different data structures, the data transformation steps have to happen *right before* the data is fit to the model and need to be done on exactly the data used by the model. So if we have one data frame that we run multiple models on, but the transformation happens to the whole data frame, we could end up with issues due to how the transformations were applied. The workflows package keeps track of those things for us, so we can focus on the higher level thinking rather than on the small details of running the models.

The workflows package has a few main functions for combining the recipe with the model specs, as well as several for updating an existing workflow (which might be useful if you need to run many models of slightly different types). All model workflows need to start with `workflow()`

, followed by two main functions: `add_model()`

and `add_recipe()`

. Can you guess what they do?

```
workflow() %>%
add_model(log_reg_specs) %>%
add_recipe(recipe_specs)
```

```
#> ══ Workflow ════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_normalize()
#>
#> ── Model ───────────────────────────────────────────────────────────
#> Logistic Regression Model Specification (classification)
#>
#> Computational engine: glm
```

While this code is already pretty concise, let’s convert it into a function to make it simplified. We’ll use the same function-oriented workflow that we’ve used before, where the function should ultimately be inside the `R/functions.R`

file.

```
#' Create a workflow object of the model and transformations.
#'
#' @param model_specs The model specs
#' @param recipe_specs The recipe specs
#'
#' @return A workflow object
#'
create_model_workflow <- function(model_specs, recipe_specs) {
workflows::workflow() %>%
workflows::add_model(model_specs) %>%
workflows::add_recipe(recipe_specs)
}
```

Instead of using the previously created objects, let’s start the model creation from scratch:

```
model_workflow <- create_model_workflow(
logistic_reg() %>%
set_engine("glm"),
lipidomics_wide %>%
create_recipe_spec(metabolite_cholesterol)
)
model_workflow
```

```
#> ══ Workflow ════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_normalize()
#>
#> ── Model ───────────────────────────────────────────────────────────
#> Logistic Regression Model Specification (classification)
#>
#> Computational engine: glm
```

Now, we can do the final thing: Fitting the data to the model with `fit()`

! 🙌

```
#> ══ Workflow [trained] ══════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_normalize()
#>
#> ── Model ───────────────────────────────────────────────────────────
#>
#> Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
#>
#> Coefficients:
#> (Intercept) genderW
#> 0.1053 -0.7066
#> age metabolite_cholesterol
#> 0.0069 1.0880
#>
#> Degrees of Freedom: 35 Total (i.e. Null); 32 Residual
#> Null Deviance: 49.9
#> Residual Deviance: 42.8 AIC: 50.8
```

This gives us a lot of information, but what we are mostly interested in is the model estimates themselves. While this `fitted_model`

object contains a lot additional information inside, workflows thankfully has a function to extract the information we want. In this case, it is the `extract_fit_parsnip()`

function.

```
fitted_model %>%
extract_fit_parsnip()
```

```
#> parsnip model object
#>
#>
#> Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
#>
#> Coefficients:
#> (Intercept) genderW
#> 0.1053 -0.7066
#> age metabolite_cholesterol
#> 0.0069 1.0880
#>
#> Degrees of Freedom: 35 Total (i.e. Null); 32 Residual
#> Null Deviance: 49.9
#> Residual Deviance: 42.8 AIC: 50.8
```

To get this information in a tidier format, we use another function: `tidy()`

. This function comes from the broom package, which is part of the tidymodels. But we should explicitly add it to the dependencies:

`use_package("broom")`

Then, we add the `tidy()`

function to our model using the `%>%`

pipe. Since we are using a logistic regression model, we need to consider how we want the estimates to be presented, probably depending on how we want to visualize our results. If we set `exponentiate = TRUE`

in `tidy()`

, the output estimates will be odds ratios, if we set `exponentiate = FALSE`

, we will get the log odds ratios or the beta coefficient. Here we choose `exponentiate = TRUE`

:

```
fitted_model %>%
extract_fit_parsnip() %>%
tidy(exponentiate = TRUE)
```

```
#> # A tibble: 4 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 1.11 1.29 0.0817 0.935
#> 2 genderW 0.493 0.779 -0.907 0.365
#> 3 age 1.01 0.0377 0.183 0.855
#> 4 metabolite_cholesterol 2.97 0.458 2.38 0.0175
```

We now have a data frame of our model results! Like we did with the `workflows()`

code that we converted into a function, we do the same thing here: Make another function (and move it to `R/functions.R`

)! 😛

```
#' Create a tidy output of the model results.
#'
#' @param workflow_fitted_model The model workflow object that has been fitted.
#'
#' @return A data frame.
#'
tidy_model_output <- function(workflow_fitted_model) {
workflow_fitted_model %>%
workflows::extract_fit_parsnip() %>%
broom::tidy(exponentiate = TRUE)
}
```

Replacing the code in the `doc/learning.qmd`

file to use the function.

```
fitted_model %>%
tidy_model_output()
```

```
#> # A tibble: 4 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 1.11 1.29 0.0817 0.935
#> 2 genderW 0.493 0.779 -0.907 0.365
#> 3 age 1.01 0.0377 0.183 0.855
#> 4 metabolite_cholesterol 2.97 0.458 2.38 0.0175
```

If we revise the code so it is one pipe, it would look like:

```
create_model_workflow(
logistic_reg() %>%
set_engine("glm"),
lipidomics_wide %>%
create_recipe_spec(metabolite_cholesterol)
) %>%
fit(lipidomics_wide) %>%
tidy_model_output()
```

```
#> # A tibble: 4 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 1.11 1.29 0.0817 0.935
#> 2 genderW 0.493 0.779 -0.907 0.365
#> 3 age 1.01 0.0377 0.183 0.855
#> 4 metabolite_cholesterol 2.97 0.458 2.38 0.0175
```

Let’s briefly cover what these columns and values mean.

Before ending, open the Git interface and commit the changes you made with ` or with the Palette (``, then type “commit”). Then push your changes up to GitHub.`

## 7.9 Summary

- Create research questions that (ideally) are structured in a way to mimic how the statistical analysis will be done, preferably in a “formula” style like \(y = x1 + x2 + ... + error\) and in a diagram style with links connecting variables.
- Statistical analyses, while requiring some trial and error, are surprisingly structured in the workflow and steps taken. Use this structure to help guide you in completing tasks related to running analyses.
- Use parsnip functions to define the model you want to use, like
`logistic_reg()`

for logistic regression, and set the computational “engine” with`set_engine()`

. - Use recipes functions to set up the data transformation steps necessary to effectively run the statistical analysis, like adding variable “roles” (outcome vs predictor) using
`update_roles()`

and adding transformation steps using any of the dozen different`step_`

functions. - Use workflows functions to develop an analysis
`workflow()`

that combines the defined model with`add_model()`

, the transformation steps with`add_recipe()`

, and the data with`fit()`

. - Use broom to
`tidy()`

the model output, extracted using`extract_fit_parsnip()`

to get a data frame of the estimates and standard error for the variables in the model.