12 Download the workshop data

We’re going to use a real world dataset to demonstrate the concepts in this workshop. We’re going to use an openly licensed dataset with metabolomics data (1).

Herance JR, Ciudin A, Lamas-Domingo R, Aparicio-Gómez C, Hernández C, Simó R, et al. Normalized NMR integration values from the metabolomic and lipidomic analysis of red blood cell extract from type 1 diabetes patients and matched healthy subjects. Zenodo; 2022.

Similar to the intermediate workshop, we will follow the principle of building a reproducible pipeline, from raw data to the report on the analysis. We’ll also follow the principle of keeping the raw data raw and use code to tidy it up. The ultimate goal is to have a record of how we exactly went from raw data to results in the paper. Unlike the intermediate workshop where you had to write through each step of the script and what you needed to do, in this workshop you only need to copy and paste the code that will download and minimally process the dataset into an initially workable state. If you want to learn more about how to take a raw data set and process it into a format that is more suitable for analysis, check out the intermediate workshop.

Inside the data-raw/ folder, we are going to write R code that downloads a dataset, processes it into a more tidy format, and save it into the data/ folder. This is the start of our analysis pipeline. First step, we need to create the script to do these steps. While in your AdvancedR3 R Project, go to the Console pane in RStudio and type out:

Console

usethis::use_data_raw("nmr-omics")

What this function does is create a new folder called data-raw/ and creates an R script called nmr-omics.R in that folder. This is where we will store the raw, original metabolomics data that we’ll get from the website. If you go to the website with the dataset, you’ll notice (when you scroll down) that there are three files: A README file, a metabolomics .xlsx file, and a lipidomics .xlsx file. For now, we only want the README file and lipidomics dataset.

The R script should have opened up for you, otherwise, go into the data-raw/ folder and open up the new nmr-omics.R script. The first thing to do is delete all the code in the script. Than, copy and paste the code below into the script.

data-raw/nmr-omics.R

# Load necessary packages -------------------------------------------------

library(readxl)
library(dplyr)
library(tidyr)
library(here)

# Download dataset --------------------------------------------------------

# From DOI: 10.5281/zenodo.6597902
# Direct URL: https://zenodo.org/record/6597902

# Get both README and the Lipidomics dataset.
nmr_omics_dir <- here("data-raw/nmr-omics")
fs::dir_create(nmr_omics_dir)

download.file(
  "https://zenodo.org/record/6597902/files/README.txt",
  destfile = here(nmr_omics_dir, "README.txt")
)

download.file(
  "https://zenodo.org/record/6597902/files/NMR_Integration_Data_Lipidomics.xlsx",
  destfile = here(nmr_omics_dir, "lipidomics.xlsx"),
  mode = "wb"
)

# Wrangle dataset into tidy long format -----------------------------------

lipidomics_full <- read_xlsx(
  here(nmr_omics_dir, "lipidomics.xlsx"),
  col_names = paste0("V", 1:40)
)

# There are actually two sets of data in this dataset that we need to split:
# - Lipidomic data
# - Subject level data

# Keep only lipidomic values
lipidomics_only <- lipidomics_full |>
  # Want to remove columns 2, 3, and 4 since they are "limits"
  # (we don't need them for this workshop)
  select(-2:-4) |>
  # Remove the subject data rows
  slice(-1:-4) |>
  mutate(across(-V1, as.numeric)) |>
  # Make it so the metabolite values are all in one column,
  # which will make it easier to join with the subject data later.
  pivot_longer(-V1) |>
  rename(metabolite = V1) |>
  # Fix spelling of 'internal'
  mutate(metabolite = str_replace(metabolite, "internal", "internal"))

# Keep only subject data
subject_only <- lipidomics_full |>
  # Remove the first metabolic name and limit columns,
  # don't need for this
  select(-1:-3) |>
  # Keep only the subject data raw
  slice(1:4) |>
  pivot_longer(cols = -V4) |>
  pivot_wider(names_from = V4, values_from = value) |>
  # There is a weird "" before some of the numbers, so we have
  # extract just the number first before converting to numeric.
  mutate(Age = as.numeric(stringr::str_extract(Age, "\\d+"))) |>
  rename_with(snakecase::to_snake_case)

lipidomics <- full_join(
  subject_only,
  lipidomics_only
) |>
  # Don't need anymore
  select(-name)

# Save to `data/` ---------------------------------------------------------

readr::write_csv(lipidomics, here::here("data/lipidomics.csv"))

Since this is an advanced workshop, you can run the lines one at a time and see what they do on your own (or source() them all at once). The comments provided give guidance on what the code is doing and why. In the end, though, the only important thing is to run all the code and get the lipidomics dataset to be saved as data/lipidomics.csv. The created files should look like this:

AdvancedR3
├── data
│   └── lipidomics.csv
└── data-raw
    ├── nmr-omics
    │   ├── README.txt
    │   └── lipidomics.xlsx
    └── nmr-omics.R

And when using readr::read_csv(), the created dataset in data/lipidomics.csv should look like:

readr::read_csv(here::here("data/lipidomics.csv"))

Rows: 504 Columns: 6
── Column specification ────────────────────────────────────────────────
Delimiter: ","
chr (4): code, gender, class, metabolite
dbl (2): age, value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 504 × 6
   code   gender   age class metabolite                value
   <chr>  <chr>  <dbl> <chr> <chr>                     <dbl>
 1 ERI109 M         25 CT    TMS (internal standard)  208.  
 2 ERI109 M         25 CT    Cholesterol               19.8 
 3 ERI109 M         25 CT    Lipid CH3- 1              44.1 
 4 ERI109 M         25 CT    Lipid CH3- 2             147.  
 5 ERI109 M         25 CT    Cholesterol               27.2 
 6 ERI109 M         25 CT    Lipid -CH2-              587.  
 7 ERI109 M         25 CT    FA -CH2CH2COO-            31.6 
 8 ERI109 M         25 CT    PUFA                      29.0 
 9 ERI109 M         25 CT    Phosphatidylethanolamine   6.78
10 ERI109 M         25 CT    Phosphatidycholine        41.7 
# ℹ 494 more rows

Take a look through the downloaded data-raw/nmr-omics/README.txt and data-raw/nmr-omics/lipidomics.xlsx files, as well as the created data/lipidomics.csv to begin better understanding the dataset.

We are using Git to track changes made to files in our project. The original metabolomics dataset is stored on Zenodo, so we don’t actually need to keep the raw data files in our Git history. So let’s tell Git to ignore the files created in the data-raw/nmr-omics/ folder. In the Console, type out the code below. You only need to do this once.

Console

usethis::use_git_ignore("data-raw/nmr-omics/")

Next, run this command to check that everything is setup correctly. You will need to paste this output into the survey at the end.

Console

r3::check_project_setup_advanced()

Show folders and files of project:
• Please copy and paste this output into the survey question:
/home/luke/Desktop/AdvancedR3
├── AdvancedR3.Rproj
├── DESCRIPTION
├── R
│   ├── README.md
│   └── functions.R
├── README.md
├── TODO.md
├── data
│   ├── README.md
│   └── lipidomics.csv
├── data-raw
│   ├── README.md
│   ├── nmr-omics
│   │   ├── README.txt
│   │   └── lipidomics.xlsx
│   └── nmr-omics.R
└── doc
    ├── README.md
    ├── learning.qmd
    └── report.Rmd

The output should look something a bit like the above text. If it doesn’t, start over by deleting everything in the data-raw/ folder except for the data-raw/nmr-omics.R script and rerunning the script again. If your output looks a bit like the above, than copy and paste the output into the survey question at the end.

12.1 Code used in session

This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.

usethis::use_data_raw("nmr-omics")
usethis::use_git_ignore("data-raw/nmr-omics/")