13 Download the workshop data

We’re going to use a real world dataset to demonstrate the concepts in this workshop. We’re going to use an openly licensed dataset with metabolomics data (1).

Similar to the intermediate workshop, we will follow the principle of building a reproducible pipeline, from raw data to the report with the analysis. We’ll also follow the principle of keeping the raw data raw and use code to tidy it up. The ultimate goal is to have a record of how we exactly went from raw data to results in the paper. In the pre-workshop tasks of the intermediate workshop, we had you write each line of code to download and clean the data as part of the learning process. However, in this workshop we only need you to copy and paste the code that will download and minimally process the dataset into an initially workable state. That’s because we assume at this point that you understand enough R to know how to do basic downloading and processing of data. If you want to learn more about how to take a raw data set and process it into a format that is more suitable for analysis, check out the intermediate workshop.

Inside the data-raw/ folder, we are going to write R code that downloads a dataset, processes it into a more tidy format, and save it into the data/ folder. This is the start of our analysis pipeline. First step, we need to create the script to do these steps. While in your AdvancedR3 R Project, go to the Console pane in RStudio and type out:

Console

usethis::use_data_raw("nmr-omics")

What this function does is create a new folder called data-raw/ and creates an R script called nmr-omics.R in that folder. This is where we will store the raw, original metabolomics data that we’ll get from the website. If you go to the website with the dataset, you’ll notice (when you scroll down) that there are three files: A README file, a metabolomics .xlsx file, and a lipidomics .xlsx file. For now, we only want the README file and lipidomics dataset.

The R script should have opened up for you, otherwise, go into the data-raw/ folder and open up the new nmr-omics.R script. The first thing to do is delete all the code in the script. Than, copy and paste the code below into the script. Since this is an advanced workshop, you can run the lines one at a time and see what they do on your own (or source() them all at once). The comments provided give guidance on what the code is doing and why.

data-raw/nmr-omics.R

# Load necessary packages -------------------------------------------------

library(readxl)
library(dplyr)
library(tidyr)
library(stringr)
library(here)

# Download dataset --------------------------------------------------------

# From DOI: 10.5281/zenodo.6597902
# Direct URL: https://zenodo.org/record/6597902

# Create a variable to store the path to `data-raw/nmr-omics/`,
# then create a directory so we can store the raw data there.
nmr_omics_dir <- here("data-raw/nmr-omics")
fs::dir_create(nmr_omics_dir)

# Download the README.
download.file(
  "https://zenodo.org/record/6597902/files/README.txt",
  destfile = here(nmr_omics_dir, "README.txt")
)

# Download the Lipidomics dataset. If you want, you can open this file
# to see what it looks like and to better understand why we need to
# process it the way we do below.
download.file(
  "https://zenodo.org/record/6597902/files/NMR_Integration_Data_Lipidomics.xlsx",
  destfile = here(nmr_omics_dir, "lipidomics.xlsx"),
  mode = "wb"
)

# Wrangle dataset into tidy long format -----------------------------------

# Get all the data from the lipidomics sheet.
# Column names are not provided, so we just use generic names
# for now.
lipidomics_full <- read_xlsx(
  here(nmr_omics_dir, "lipidomics.xlsx"),
  col_names = paste0("V", 1:40)
)

# There are actually two sets of data in this dataset that we need to split:
# - Subject level data (first four rows).
# - Lipidomic data (all other rows).

# Keep only lipidomic values
lipidomics_only <- lipidomics_full |>
  # Want to remove columns 2, 3, and 4 since they are "limits"
  # (we don't need them for this workshop).
  select(-2:-4) |>
  # Remove the subject data rows (the first four rows).
  slice(-1:-4) |>
  # Convert all the metabolite values to numeric (they were read as
  # character because of the headers in the first four rows).
  mutate(across(-V1, as.numeric)) |>
  # Make it so the metabolite values are all in one column, rather
  # than spread across multiple columns. This will make it easier to
  # join with the subject data later. Don't include V1, since it
  # contains the metabolite names.
  pivot_longer(-V1) |>
  rename(metabolite = V1) |>
  # Fix spelling of 'internal' (from `interntal`)
  mutate(metabolite = str_replace(metabolite, "interntal", "internal"))

# Keep only subject data, which is the first four rows of the spreadsheet.
subject_only <- lipidomics_full |>
  # Remove the first metabolic name and limit columns, as we don't need
  # them for this part.
  select(-1:-3) |>
  # Keep only the subject data raw (first four rows).
  slice(1:4) |>
  # V4 contains the variable names, so we pivot it longer so each
  # data value is aligned with the variable name.
  pivot_longer(cols = -V4) |>
  # We want to pivot wider, so that each variable name (right now
  # duplicated multiple times in the long format) becomes its own
  # column. That way we have 4 columns (from the first four rows):
  # Code (subject ID), Gender, Age, Class (treatment group).
  pivot_wider(names_from = V4, values_from = value) |>
  # There is a weird "" before some of the numbers, so we have
  # extract just the number first before converting to numeric.
  mutate(Age = as.numeric(str_extract(Age, "\\d+"))) |>
  # Align naming so variables are in `snake_case`.
  rename_with(snakecase::to_snake_case)

# Now we join the two datasets together to get a tidy lipidomics dataset.
lipidomics <- full_join(
  subject_only,
  lipidomics_only
) |>
  # Don't need anymore.
  select(-name)

# Save to `data/` ---------------------------------------------------------

readr::write_csv(lipidomics, here::here("data/lipidomics.csv"))

After running this code (either line-by-line or via source()), you should now have the lipidomics dataset saved in the data/lipidomics.csv file. The created files should look like this:

AdvancedR3
├── data
│   └── lipidomics.csv
└── data-raw
    ├── nmr-omics
    │   ├── README.txt
    │   └── lipidomics.xlsx
    └── nmr-omics.R

The created dataset in data/lipidomics.csv should look like when loading it using readr::read_csv(),

readr::read_csv(here::here("data/lipidomics.csv"))

Rows: 504 Columns: 6
── Column specification ────────────────────────────────────────────────
Delimiter: ","
chr (4): code, gender, class, metabolite
dbl (2): age, value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 504 × 6
   code   gender   age class metabolite                value
   <chr>  <chr>  <dbl> <chr> <chr>                     <dbl>
 1 ERI109 M         25 CT    TMS (internal standard)  208.  
 2 ERI109 M         25 CT    Cholesterol               19.8 
 3 ERI109 M         25 CT    Lipid CH3- 1              44.1 
 4 ERI109 M         25 CT    Lipid CH3- 2             147.  
 5 ERI109 M         25 CT    Cholesterol               27.2 
 6 ERI109 M         25 CT    Lipid -CH2-              587.  
 7 ERI109 M         25 CT    FA -CH2CH2COO-            31.6 
 8 ERI109 M         25 CT    PUFA                      29.0 
 9 ERI109 M         25 CT    Phosphatidylethanolamine   6.78
10 ERI109 M         25 CT    Phosphatidycholine        41.7 
# ℹ 494 more rows

Take a look through the downloaded data-raw/nmr-omics/README.txt and data-raw/nmr-omics/lipidomics.xlsx files, as well as the created data/lipidomics.csv to begin understanding the dataset.

We are using Git to track changes made to files in our project. The original metabolomics dataset is stored on Zenodo, so we don’t actually need to keep the raw data files in our Git history. So let’s tell Git to ignore the files created in the data-raw/nmr-omics/ folder. In the Console, type out the code below. You only need to do this once.

Console

usethis::use_git_ignore("data-raw/nmr-omics/")

Next, run this command to check that everything is setup correctly. You will need to paste this output into the survey at the end.

Console

r3::check_project_setup_advanced()

Show folders and files of project:
• Please copy and paste this output into the survey question:
/home/luke/Desktop/AdvancedR3
├── AdvancedR3.Rproj
├── .gitignore
├── DESCRIPTION
├── R/
│   ├── README.md
│   └── functions.R
├── README.md
├── TODO.md
├── data/
│   ├── README.md
│   └── lipidomics.csv
├── data-raw/
│   ├── README.md
│   ├── nmr-omics
│   │   ├── README.txt
│   │   └── lipidomics.xlsx
│   └── nmr-omics.R
└── docs/
    ├── README.md
    └── learning.qmd

Important

If you encounter an error that says something like “Error in gh::gh()”, you can (potentially) solve it by running gitcreds::gitcreds_set() in the Console as you did in Chapter 9 and then running the r3::check_project_setup_advanced() command again.

The output should look something a bit like the above text. If it doesn’t, start over by deleting everything in the data-raw/ folder except for the data-raw/nmr-omics.R script and re-running the script again. If your output looks a bit like the above, than copy and paste the output into the survey question at the end.

13.1 Automatically adhere to a style guide

We’ve covered how to adhere to a style guide in both the introductory and intermediate workshop, because it is such a useful and powerful tool to easily write more readable code. We’ll expand on it in this workshop because it fits precisely with the theme of collaboration. That’s because, when you’re working on your own and don’t need to worry about anyone seeing your code, there’s a natural temptation to write your code like you might write notes to yourself: scribbled and scrawled down quickly.

But when working with others, how it looks can greatly impact how quickly and easily others are able to read and interpret your code. Multiply all the collaborators on a project with this natural temptation for quick (and potentially sloppy) coding, you can imagine how easy it is for code to massively “drift” towards being poorly formatted, especially when deadlines are approaching.

That’s when “linters” or “stylers” (types of static code analysis tools) become very useful. They scan your code for common mistakes or syntax issues, either listing them for you to fix or automatically fixing them for you. Linters are especially useful in ensuring the code is consistently formatted across the project when collaborating with someone who is less experienced with coding or who contributes occasionally.

To ensure consistent styling without needing to run checks manually, automatic linting/styling can be a great solution. That’s where the styler package comes in!

There are only a few functions in styler that we can to use. The first is used to style a single file with R code in it using styler::style_file(). However, an easier way to style the file we’re currently working on is the “style active file” RStudio addin. We use that through the Command Palette (Ctrl-Shift-P) and typing “style file”, which should show the “Style active file” option. This is what we will do frequently throughout the workshop.

Let’s try it out. While inside the data-raw/nmr-omics.R file, use the Palette (Ctrl-Shift-P, then type “style file”) to style the file. Notice that no changes were made since the file is already tidy.

Note

You will probably be asked to install a package called miniUI, click “Yes”.

If you wanted to run styler on all the files, you can use:

Console

styler::style_dir()

13.2 Code used in session

This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.

usethis::use_data_raw("nmr-omics")
usethis::use_git_ignore("data-raw/nmr-omics/")
styler::style_dir()