Want to help out or contribute?

If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitHub.

On GitHub open an issue or submit a pull request by clicking the " Edit this page" link at the side of this page.

3  Pre-course tasks

In order to participate in this course, you must complete the pre-course tasks in this section as well as completing the survey at the end. These tasks are designed to make it easier for everyone to start the course with everything ready to go. For some of the tasks, you might not understand why you need to do them, but you will likely understand why once the course begins.

Depending on your skills and knowledge, these tasks could take between 5-7 hrs to finish, so we suggest planning a full day to complete them. Depending on your institution and how they handle installing software on work computers, you also might have to contact IT very early to make sure everything is properly installed and set up.

3.1 List of tasks

Here’s a quick overview of the tasks you need to do. Specific details about them are found as you work through this section.

  1. Read the learning objectives in Section 3.2 for the pre-course tasks.
  2. Read about how to read this website in Section 3.3.
  3. Install the necessary programs and the right versions of those programs in Section 3.4. For some people, depending on their institution, this task can take the longest amount of time because you have to contact your IT to install these packages.
  4. Install the necessary R packages in Section 3.5.
  5. Correctly set up Git on your computer in Section 3.6, if you haven’t done this already from previous courses. If you haven’t used Git before, this task could take a while because of the reading.
  6. Run a check with r3::check_setup() to see if everything works. You’ll later need to paste this output into the survey.
  7. Create an R Project in Section 3.7, along with the folder and file setup.
  8. Create an Quarto file.
  9. Write (well, mostly copy and paste) R code to download the data and save it to your computer. This task will probably take up maybe 30-60 minutes depending on your interest of exploring the data.
  10. Run a check using r3::check_project_setup_advanced() to see that everything is as expected. You’ll later need to paste this output into the survey.
  11. Read about the basic course details in Section 3.10.
  12. Read the syllabus in Chapter 1.
  13. Read the Code of Conduct.
  14. Complete the pre-course survey. This survey is pretty quick, maybe ~10 minutes.

Check each section for exact details on completing these tasks.

3.2 Learning objective

In general, these pre-course tasks are meant to help prepare you for the course and make sure everything is setup properly so the first session runs smoothly. However, some of these tasks are meant for learning as well as for general setup, so we have defined the following learning objectives for this page:

  1. Learn about and then apply some basic reproducible workflows and setups for the initial processing of raw data. For those who have already participated in the intermediate R course, the objective is to review what you previously learned.

3.3 Reading the course website

We will explain this a bit during the course, but read this to start learning how the website is structured and how to read certain things. Specifically, there are a few “syntax” type formatting of the text in this website to be aware of:

  • Folder names always end with a /, for example data/ means the data folder.
  • R variables are always shown as is. For instance, for the code x <- 10, x is a variable because it was assigned with 10.
  • Functions always end with (), for instance mean() or read_csv().
  • Sometimes functions have their package name appended with :: to indicate to run the code from the specific package, since we likely haven’t loaded the package with library(). For instance, to install packages from GitHub using the pak package we use pak::pkg_install("user/packagename"). You’ll learn about this more later.
  • Reading tasks always start with a statement “Reading task” and are enclosed in a “callout block”. Read within this block. We will usually go over the section again to reinforce the concepts and address any questions.

3.4 Installing the latest programs

The first thing to do is to install these programs. You may already have some of them installed and if you do, please make sure that they are at least the minimum versions listed below. If not, you will need to update them.

  1. R: Any version above 4.1.2. If you have used R before, you can confirm the version by running R.version.string in the Console.
  2. RStudio: Any version above v2021.09.0+351. If you have installed it before, check the current version by going to the menu Help -> About RStudio.
  3. Git: Select the “Click here for download” link. Git is used throughout many sessions in the courses. When installing, it will ask for a selecting a “Text Editor” and while we won’t be using this in the course, Git needs to know this information so choose Notepad.
  4. Rtools: Version that says “R-release”. Rtools is needed in order to build some R packages. For some computers, installing Rtools can take some time.
  1. R: Any version above 4.1.2. If you have used R before, you can confirm the version by running R.version.string in the Console. If you use Homebrew, installing R is as easy as opening a Terminal and running:

    brew install r
  2. RStudio: Any version above v2021.09.0+351. If you have installed it before, check the current version by going to the menu Help -> About RStudio. With Homebrew:

    brew install --cask rstudio
  3. Git: Git is used throughout many sessions in the courses. With Homebrew:

    brew install git
  1. R: Any version above 4.1.2. If you have used R before, you can confirm the version by running R.version.string in the Console.

    sudo apt -y install r-base
  2. RStudio: Any version above v2021.09.0+351. If you have installed it before, check the current version by going to the menu Help -> About RStudio.

  3. Git: Git is used throughout many sessions in the courses.

    sudo apt install git

All these programs are required for the course, even Git. Git, which is a software program to formally manage versions of files, is used because of it’s popularity and the amount of documentation available for it. At the end of the course, you will be using Git and GitHub to manage your group assignment. Check out the online book Happy Git with R, especially the “Why Git” section, for an understanding on why we are teaching Git. Windows users tend to have more trouble with installing Git than macOS or Linux users. See the section on Installing Git for Windows for help.

Some pictures may show a Git pane in RStudio, but you may not see it. If you haven’t created or opened an RStudio R Project (which is taught in the introductory course), the Git pane does not show up. It only shows up in R Projects that use Git to track file changes.

A note to those who have or use work laptops with restrictive administrative privileges: You may encounter problems installing software due to administrative reasons (e.g. you don’t have permission to install things). Even if you have issues installing or updating the latest version of R or RStudio, you will likely be able to continue with the course as long as you have the minimum version listed above for R and for RStudio. If you have versions of R and RStudio that are older than that, you may need to ask your IT department to update your software if you can’t do this yourself. Unfortunately, Git is not a commonly used software for some organizations, so you may not have it installed and you will need to ask IT to install it. We require it for the course, so please make sure to give IT enough time to be able to install it for you prior to the course.

Once R, RStudio, and Git have been installed, open RStudio. If you encounter any troubles during these pre-course tasks, try as best as you can to complete the task and then let us know about the issues in the pre-course survey of the course. If you continue having problems, indicate on the survey that you need help and we can try to book a quick video call to fix the problem. Otherwise, you can come to the course 15-20 minutes earlier to get help.

If you’re unable to complete the setup procedure due to unfixable technical issues, you can use Posit Cloud (to use RStudio on the cloud) as a final solution in order to participate in the course. For help setting up Posit Cloud for this course, refer to the Posit Cloud setup guide.

3.5 Installing the R packages

We will be using specific R packages for the course, so you will need to install them. A detailed walkthrough for installing the necessary packages is available on the pre-course tasks for installing packages section of the introduction course, however, you only need to install the r3 helper package in order to install all the necessary packages by running these commands in the R Console:

  1. Install the pak package:

  2. Install the r3 helper package for this course:

    pak::pak("rostools/r3")
  3. Install the necessary packages for the course:

    r3::install_packages_advanced()

You might encounter an error when running this code. That’s ok, you can fix it if you restart R by going to Sessions -> Restart R and re-run the code in items 2 and 3, it should work. If it still doesn’t, try to running:

remotes::install_github("rostools/r3")

If that also doesn’t work, try to complete the other tasks, complete the survey, and let us know you have a problem in the survey.

Note: When you see a command like something::something(), for example with r3::install_packages_advanced(), you would “read” this as:

R, can you please use the install_packages_advanced() function from the r3 package.

The normal way of doing this would be to load the package with library(r3) and then running the command (install_packages_advanced()). But by using the ::, we tell R to directly use a function from a package, without needing to load the package and all of its other functions too. We use this trick because we only want to use the install_packages_advanced() command from the r3 package and not have to load all the other functions as well. In this course we will be using :: often.

3.6 Setting up Git

Since Git has already been covered in the previous courses, we won’t cover learning it during this course. However, since version control is a fundamental component of any modern data analysis workflow and should be used, we will be using it throughout the course. If you have used or currently use Git and GitHub, you can skip these two tasks. If you have not used it, please do these tasks:

  1. Follow the pre-course tasks for Git and GitHub from the introduction course. Specifically, type in the RStudio Console:

    # There will be a pop-up to type in your name (first and 
    # last), as well as your email
    r3::setup_git_config()
  2. Please read through the Version Control lesson of the introduction course. You don’t need to do any of the exercises or activities, but you are welcome to do them if it will help you learn or understand it better. For most of the course, we will be using Git as shown in the Using Git in RStudio section. During the course, we will connect our projects to GitHub, which is described in the Synchronizing with GitHub section.

Regardless of whether you’ve done the steps above or not, everyone needs to run:

r3::check_setup()

The output you’ll get for success will look something like this:

Checking R version:
✔ Your R is at the latest version of 4.3.1!
Checking RStudio version:
✔ Your RStudio is at the latest version of 2023.6.0.421!
Checking Git config settings:
✔ Your Git configuration is all setup!
  Git now knows that:
  - Your name is 'Luke W. Johnston'
  - Your email is 'lwjohnst@gmail.com'

Eventually you will need to copy and paste the output into one of the survey questions. Note that while GitHub is a natural connection to using Git, given the limited time available, we will only be going over aspects of GitHub that relate to storing your project Git repository and setting up the website. If you want to learn more about using GitHub, check out the session on it in the introduction course.

3.7 Create an R Project

One of the basic steps to reproducibility and modern workflows in data analysis is to keep everything contained in a single location. In RStudio, this is done with R Projects. Please read all of this section from the introduction course to learn about R Projects and how they help keep things self-contained. You don’t need to do any of the exercises or activities.

There are several ways to organise a project folder. We’ll be using the structure from the package prodigenr. The project setup can be done by either:

  1. Using RStudio’s New Project menu item: “File -> New Project -> New Directory”, scroll down to “Scientific Analysis Project using prodigenr” and name the project “AdvancedR3” in the Directory Name, saving it to the “Desktop” with Browse. Note: You might need to restart RStudio if you don’t see this option.
  2. Or, running the command prodigenr::setup_project("~/Desktop/AdvancedR3") (or other location like Documents) in the R Console and manually switching to it using: File -> Open Project and navigating to the project location.

When the RStudio Project opens up again, run these commands in the R Console to finish the setup:

# Add Git to the project
prodigenr::setup_with_git()
# Create a `functions.R` file in the `R/` folder
usethis::use_r("functions", open = FALSE)
# Ignore this file that gets created by some usethis functions
usethis::use_git_ignore(".Rbuildignore")
# Set some project options to start fresh each time
usethis::use_blank_slate("project")

Here we use the usethis package to help set things up. usethis is an extremely useful package for managing R Projects and we highly recommend checking it out more to see how you can use it in your own work.

3.8 R Markdown and Quarto

We teach and use R Markdown/Quarto because it is one of the very first steps to being reproducible and because it is a very powerful tool to doing data analysis. You may have heard of or used R Markdown since we’ve used it in both the introduction and intermediate courses. However, you might not have heard of or used Quarto.

Quarto is a next generation version of R Markdown and chances are, if you use a fairly recent version of RStudio, you are already using it without realizing it. That’s because Quarto uses the same Markdown syntax as R Markdown. The only difference is that with Quarto, you can create more types of output documents (like books, websites, slides), you have more options for customization, and it’s easier to do and learn than R Markdown. So, for this course, we will eventually use Quarto to make a website of the analysis that we will do over the course.

Please do these two tasks:

  1. Please read over the R Markdown/Quarto section of the introduction course. If you use R Markdown or Quarto already, you can skip this step.

  2. In the R Console while inside the AdvancedR3 project, run the function to create a new Quarto file called learning.qmd in the doc/ folder.

    r3::create_qmd_doc()

3.9 Download the course data

We’re going to use a real world dataset to demonstrate the concepts in this course. We’re going to use an openly licensed dataset with metabolomics data (1).

Similar to the intermediate course, we will follow the principle of building a reproducible pipeline, from raw data to the report on the analysis. We’ll also follow the principle of keeping the raw data raw and use code to tidy it up. The ultimate goal is to have a record of how we exactly went from raw data to results in the paper. Unlike the intermediate course where you had to write through each step of the script and what you needed to do, in this course you only need to copy and paste the code that will download and minimally process the dataset into an initially workable state. If you want to learn more about how to take a raw data set and process it into a format that is more suitable for analysis, check out the intermediate course.

Inside the data-raw/ folder, we are going to write R code that downloads a dataset, processes it into a more tidy format, and save it into the data/ folder. This is the start of our analysis pipeline. First step, we need to create the script to do these steps. While in your AdvancedR3 R Project, go to the Console pane in RStudio and type out:

usethis::use_data_raw("nmr-omics")

What this function does is create a new folder called data-raw/ and creates an R script called nmr-omics.R in that folder. This is where we will store the raw, original metabolomics data that we’ll get from the website. If you go to the website with the dataset, you’ll notice (when you scroll down) that there are three files: A README file, a metabolomics .xlsx file, and a lipidomics .xlsx file. For now, we only want the README file and lipidomics dataset.

The R script should have opened up for you, otherwise, go into the data-raw/ folder and open up the new nmr-omics.R script. The first thing to do is delete all the code in the script. Than, copy and paste the code below into the script.

# Load necessary packages -------------------------------------------------

library(readxl)
library(dplyr)
library(tidyr)
library(here)

# Download dataset --------------------------------------------------------

# From DOI: 10.5281/zenodo.6597902
# Direct URL: https://zenodo.org/record/6597902

# Get both README and the Lipidomics dataset.
nmr_omics_dir <- here("data-raw/nmr-omics")
fs::dir_create(nmr_omics_dir)

download.file("https://zenodo.org/record/6597902/files/README.txt",
  destfile = here(nmr_omics_dir, "README.txt")
)

download.file(
  "https://zenodo.org/record/6597902/files/NMR_Integration_Data_Lipidomics.xlsx",
  destfile = here(nmr_omics_dir, "lipidomics.xlsx"), mode = "wb"
)

# Wrangle dataset into tidy long format -----------------------------------

lipidomics_full <- read_xlsx(
  here(nmr_omics_dir, "lipidomics.xlsx"),
  col_names = paste0("V", 1:40)
)

# There are actually two sets of data in this dataset that we need to split:
# - Lipidomic data
# - Subject level data

# Keep only lipidomic values
lipidomics_only <- lipidomics_full %>%
  # Want to remove columns 2, 3, and 4 since they are "limits"
  # (we don't need them for this course)
  select(-2:-4) %>%
  # Remove the subject data rows
  slice(-1:-4) %>%
  mutate(across(-V1, as.numeric)) %>%
  # Make it so the metabolite values are all in one column,
  # which will make it easier to join with the subject data later.
  pivot_longer(-V1) %>%
  rename(metabolite = V1)

# Keep only subject data
subject_only <- lipidomics_full %>%
  # Remove the first metabolic name and limit columns,
  # don't need for this
  select(-1:-3) %>%
  # Keep only the subject data raw
  slice(1:4) %>%
  pivot_longer(cols = -V4) %>%
  pivot_wider(names_from = V4, values_from = value) %>%
  # There is a weird "​" before some of the numbers, so we have
  # extract just the number first before converting to numeric.
  mutate(Age = as.numeric(stringr::str_extract(Age, "\\d+"))) %>%
  rename_with(snakecase::to_snake_case)

lipidomics <- full_join(
  subject_only,
  lipidomics_only
) %>%
  # Don't need anymore
  select(-name)

# Save to `data/` ---------------------------------------------------------

readr::write_csv(lipidomics, here::here("data/lipidomics.csv"))

Since this is an advanced course, you can run the lines one at a line and see what they do on your own (or source() them all at once). The comments provided give guidance on what the code is doing and why. In the end, though, the only important thing is to run all the code and get the lipidomics dataset to be saved as data/lipidomics.csv. The created files should look like this:

AdvancedR3
├── data
│   └── lipidomics.csv
└── data-raw
    ├── nmr-omics
    │   ├── README.txt
    │   └── lipidomics.xlsx
    └── nmr-omics.R

And the created dataset in data/lipidomics.csv should look like (using readr::read_csv()):

lipidomics <- readr::read_csv(here::here("data/lipidomics.csv"))
lipidomics
#> # A tibble: 504 × 6
#>    code   gender   age class metabolite                value
#>    <chr>  <chr>  <dbl> <chr> <chr>                     <dbl>
#>  1 ERI109 M         25 CT    TMS (interntal standard) 208.  
#>  2 ERI109 M         25 CT    Cholesterol               19.8 
#>  3 ERI109 M         25 CT    Lipid CH3- 1              44.1 
#>  4 ERI109 M         25 CT    Lipid CH3- 2             147.  
#>  5 ERI109 M         25 CT    Cholesterol               27.2 
#>  6 ERI109 M         25 CT    Lipid -CH2-              587.  
#>  7 ERI109 M         25 CT    FA -CH2CH2COO-            31.6 
#>  8 ERI109 M         25 CT    PUFA                      29.0 
#>  9 ERI109 M         25 CT    Phosphatidylethanolamine   6.78
#> 10 ERI109 M         25 CT    Phosphatidycholine        41.7 
#> # ℹ 494 more rows

Take a look through the downloaded data-raw/nmr-omics/README.txt and data-raw/nmr-omics/lipidomics.xlsx files, as well as the created data/lipidomics.csv to begin better understanding the dataset.

We are using Git to track changes made to files in our project. The original metabolomics dataset is stored on Zenodo, so we don’t actually need to keep the raw data files in our Git history. So let’s tell Git to ignore the files created in the data-raw/nmr-omics/ folder. In the Console, type out the code below. You only need to do this once.

usethis::use_git_ignore("data-raw/nmr-omics/")
r3::check_project_setup_advanced()
Show folders and files of project:
• Please copy and paste this output into the survey question:
/home/luke/Desktop/AdvancedR3
├── AdvancedR3.Rproj
├── DESCRIPTION
├── R
│   ├── README.md
│   └── functions.R
├── README.md
├── TODO.md
├── data
│   ├── README.md
│   └── lipidomics.csv
├── data-raw
│   ├── README.md
│   ├── nmr-omics
│   │   ├── README.txt
│   │   └── lipidomics.xlsx
│   └── nmr-omics.R
└── doc
    ├── README.md
    ├── learning.qmd
    └── report.Rmd

The output should look something a bit like the above text. If it doesn’t, start over by deleting everything in the data-raw/ folder except for the data-raw/nmr-omics.R script and re-running the script again. If your output looks a bit like the above, than copy and paste the output into the survey question at the end.

3.10 Course introduction

Most of the course description is found in the syllabus (Chapter 1). If you haven’t read it, please read it now. Read over what the course will cover, what we expect you to learn at the end of it, and what our basic assumptions are about who you are and what you know. The final pre-course task is to complete a survey that asks if you’ve read it and if it matches you.

One goal of the course is to teach about open science, and true to our mission, we practice what we preach. The course material is publicly accessible (all on this website) and openly licensed so you can use and re-use it for free! The material and table of contents on the side is listed in the order that we will cover in the course.

We have a Code of Conduct. If you haven’t read it, read it now. The survey at the end will ask about Conduct. We want to make sure this course is a supportive and safe environment for learning, so this Code of Conduct is important.

You’re almost done. Please fill out the pre-course survey to finish this assignment.

See you at the course!