3  Pre-course tasks

In order to participate in this course, you must complete the pre-course tasks in this section as well as completing the survey at the end. These tasks are designed to make it easier for everyone to start the course with everything ready to go. For some of the tasks, you might not understand why you need to do them, but you will likely understand why once the course begins.

Depending on your skills and knowledge, these tasks could take between 5-7 hrs to finish, so we suggest planning a full day to complete them. Depending on your institution and how they handle installing software on work computers, you also might have to contact IT very early to make sure everything is properly installed and set up.

3.1 List of tasks

Here’s a quick overview of the tasks you need to do. Specific details about them are found as you work through this section.

  1. Read the learning objectives in Section 3.2 for the pre-course tasks.
  2. Read about how to read this website in Section 3.3.
  3. Install the necessary programs and the right versions of those programs in Section 3.4. For some people, depending on their institution, this task can take the longest amount of time because you have to contact your IT to install these packages.
  4. Install the necessary R packages in Section 3.5.
  5. Correctly set up Git on your computer in Section 3.6, if you haven’t done this already from previous courses. If you haven’t used Git before, this task could take a while because of the reading.
  6. Run a check with r3::check_setup() to see if everything works. You’ll later need to paste this output into the survey.
  7. Connect your computer with GitHub in Section 3.7.
  8. Create an R Project in Section 3.8, along with the folder and file setup.
  9. Create a Quarto file.
  10. Write (well, mostly copy and paste) R code to download the data and save it to your computer. This task will probably take up maybe 30-60 minutes depending on your interest in exploring the data.
  11. Run a check using r3::check_project_setup_advanced() to see that everything is as expected. You’ll later need to paste this output into the survey.
  12. Set some options in RStudio in Section 3.11.
  13. Read about the basic course details in Section 3.12.
  14. Read the syllabus in Chapter 1.
  15. Read the Code of Conduct.
  16. Pre-read some content in Section 3.13.
  17. Complete the pre-course survey. This survey is pretty quick, maybe ~10 minutes.

Check each section for exact details on completing these tasks.

3.2 Learning objective

In general, these pre-course tasks are meant to help prepare you for the course and make sure everything is setup properly so the first session runs smoothly. However, some of these tasks are meant for learning as well as for general setup, so we have defined the following learning objectives for this page:

  1. Learn about and then apply some basic reproducible workflows and setups for the initial processing of raw data. For those who have already participated in the intermediate R course, the objective is to review what you previously learned.

3.3 Reading the course website

We will explain this a bit during the course, but read this to start learning how the website is structured and how to read certain things. Specifically, there are a few “syntax” type formatting of the text in this website to be aware of:

  • Folder names always end with a /, for example data/ means the data folder.
  • R variables are always shown as is. For instance, for the code x <- 10, x is a variable because it was assigned with 10.
  • Functions always end with (), for instance mean() or read_csv().
  • Sometimes functions have their package name appended with :: to indicate to run the code from the specific package, since we likely haven’t loaded the package with library(). For instance, to install packages from GitHub using the pak package we use pak::pkg_install("user/packagename"). You’ll learn about this more later.
  • Reading tasks always start with a statement “Reading task” and are enclosed in a “callout block”. Read within this block. We will usually go over the section again to reinforce the concepts and address any questions.

3.4 Installing the latest programs

The first thing to do is to install these programs. You may already have some of them installed and if you do, please make sure that they are at least the minimum versions listed below. If not, you will need to update them.

  1. R: Any version above r min(r3admin::get_allowed_r_versions()). If you have used R before, you can confirm the version by running R.version.string in the Console.
  2. RStudio: Any version above r r3admin::get_allowed_rstudio_versions()[1]. If you have installed it before, check the current version by going to the menu Help -> About RStudio.
  3. Git: Select the “Click here to download” link. Git is used throughout many sessions in the courses. When installing, it will ask for a selecting a “Text Editor” and while we won’t be using this in the course, Git needs to know this information so choose Notepad.
  4. Rtools: Version that says “R-release”. Rtools is needed in order to build some R packages. For some computers, installing Rtools can take some time.
  1. R: Any version above r min(r3admin::get_allowed_r_versions()). If you have used R before, you can confirm the version by running R.version.string in the Console. If you use Homebrew, installing R is as easy as opening a Terminal and running:

    brew install --cask r
  2. RStudio: Any version above r r3admin::get_allowed_rstudio_versions()[1]. If you have installed it before, check the current version by going to the menu Help -> About RStudio. With Homebrew:

    brew install --cask rstudio
  3. Git: Git is used throughout many sessions in the courses. With Homebrew:

    brew install git
  1. R: Any version above r min(r3admin::get_allowed_r_versions()). If you have used R before, you can confirm the version by running R.version.string in the Console.

    sudo apt -y install r-base
  2. RStudio: Any version above r r3admin::get_allowed_rstudio_versions()[1]. If you have installed it before, check the current version by going to the menu Help -> About RStudio.

  3. Git: Git is used throughout many sessions in the courses.

    sudo apt install git

All these programs are required for the course, even Git. Git, which is a software program to formally manage versions of files, is used because of it’s popularity and the amount of documentation available for it. Check out the online book Happy Git with R, especially the “Why Git” section, for an understanding on why we are teaching Git. Windows users tend to have more trouble with installing Git than macOS or Linux users. See the section on Installing Git for Windows for help.

Note

Some pictures may show a Git pane in RStudio, but you may not see it. If you haven’t created or opened an RStudio R Project (which is taught in the introductory course), the Git pane does not show up. It only shows up in R Projects that use Git to track file changes.

Note

A note to those who have or use work laptops with restrictive administrative privileges: You may encounter problems installing software due to administrative reasons (e.g. you don’t have permission to install things). Even if you have issues installing or updating the latest version of R or RStudio, you will likely be able to continue with the course as long as you have the minimum version listed above for R and for RStudio. If you have versions of R and RStudio that are older than that, you may need to ask your IT department to update your software if you can’t do this yourself. Unfortunately, Git is not a commonly used software for some organizations, so you may not have it installed and you will need to ask IT to install it. We require it for the course, so please make sure to give IT enough time to be able to install it for you prior to the course.

Once R, RStudio, and Git have been installed, open RStudio. If you encounter any troubles during these pre-course tasks, try as best as you can to complete the task and then let us know about the issues in the pre-course survey of the course. If you continue having problems, indicate on the survey that you need help and we can try to book a quick video call to fix the problem. Otherwise, you can come to the course 15-20 minutes earlier to get help.

If you’re unable to complete the setup procedure due to unfixable technical issues, you can use Posit Cloud (to use RStudio on the cloud) as a final solution in order to participate in the course. For help setting up Posit Cloud for this course, refer to the Posit Cloud setup guide.

3.5 Installing the R packages

We will be using specific R packages for the course, so you will need to install them. A detailed walkthrough for installing the necessary packages is available on the pre-course tasks for installing packages section of the introduction course, however, you only need to install the r3 helper package in order to install all the necessary packages by running these commands in the R Console:

  1. Install the pak package:

  2. Install the r3 helper package for this course:

    pak::pak("rostools/r3")
  3. Install the necessary packages for the course:

    r3::install_packages_advanced()
Warning

You might encounter an error when running this code. That’s ok, you can fix it if you restart R by going to Sessions -> Restart R and re-run the code in items 2 and 3, it should work. If it still doesn’t, try to running:

remotes::install_github("rostools/r3")

If that also doesn’t work, try to complete the other tasks, complete the survey, and let us know you have a problem in the survey.

Note: When you see a command like something::something(), for example with r3::install_packages_advanced(), you would “read” this as:

R, can you please use the install_packages_advanced() function from the r3 package.

The normal way of doing this would be to load the package with library(r3) and then running the command (install_packages_advanced()). But by using the ::, we tell R to directly use a function from a package, without needing to load the package and all of its other functions too. We use this trick because we only want to use the install_packages_advanced() command from the r3 package and not have to load all the other functions as well. In this course we will be using :: often.

3.6 Setting up Git

Since Git has already been covered in the previous courses, we won’t cover learning it during this course. However, since version control is a fundamental component of any modern data analysis workflow and should be used, we will be using it throughout the course. If you have used or currently use Git and GitHub, you can skip these two tasks. If you have not used Git or GitHub, please do these tasks:

  1. Type in the RStudio Console and follow the instructions:

    Console
    # There will be a pop-up to type in your name (first and 
    # last), as well as your email
    r3::setup_git_config()
  2. Please ONLY read through the Version Control lesson of the introduction course, but DO NOT DO any coding tasks or exercises. You are welcome to do them if it will help you learn or understand it better, but it is NOT required. For most of the course, we will be using Git as shown in the Using Git in RStudio section. During the course, we will connect our projects to GitHub, which is described below.

Regardless of whether you’ve done the steps above or not, everyone needs to run:

Console
r3::check_setup()

The output you’ll get for success will look something like this:

Checking R version:
✔ Your R is at the latest version of 4.4.1!
Checking RStudio version:
✔ Your RStudio is at the latest version of 2024.9.0.375!
Checking Git config settings:
✔ Your Git configuration is all setup!
  Git now knows that:
  - Your name is 'Luke W. Johnston'
  - Your email is 'lwjohnst@gmail.com'

Eventually you will need to copy and paste the output into one of the survey questions. Note that while GitHub is a natural connection to using Git, given the limited time available, we will only be going over aspects of GitHub that relate to storing your project Git repository and working together. If you want to learn more about using GitHub, check out the session on it in the introduction course.

3.7 Connect with GitHub

Because we’ll be pushing and pulling to GitHub throughout the course, as well as using GitHub to collaborate with others in the project work, you need to setup your computer to connect with GitHub.

Any time we do anything on the Internet, there is some risk to having our information maliciously hacked. This is no different when using GitHub, so if we can, we should try to be more secure with what we send across the internet. In fact, most functions that relate to Git or using GitHub require using more secure features in order to work. usethis makes this much easier, thankfully, with several functions. The usethis website has a really well written guide on setting it up. Here is a very simplified version of what they recommend that is relevant for what we are doing in this course.

  • Use personal access tokens (PAT, or simply called a “token”) when interacting with your GitHub remote repositories while outside of the GitHub website (e.g. when using R or usethis). PAT’s are like temporary passwords that provide limited access to your GitHub account, like being able to read or write to your GitHub repositories, but not being able to delete them. They are useful because you can easily delete the PAT if you feel someone got access to it and prevent it from being used, unlike your own password which you would have to manually change if it was stolen.

  • Use a password manager to save the PAT for later use. Using password managers is basically a requirement for having secure online accounts, because they can generate random and long passwords that you don’t have to remember.

  • Use packages like gitcreds to give usethis access to the PAT and to interact with your GitHub repositories. You normally would use gitcreds every time you restart your computer or after a certain period of time.

What is a password manager?

A password manager is an app or web service that let’s you save or create passwords for all your accounts, like banking or social media. Instead of having to remember multiple passwords used across multiple accounts, or the very insecure approach of one or two passwords for all your accounts, you instead need to remember only one very secure password that contains all your other very secure passwords. Google “password manager” and your operating system (Windows, MacOS) to find possible ones to install or use.

Bitwarden is a very good password manager that is easy to use and the free version has everything you need to manage, store, and create passwords.

You very likely haven’t set up a PAT, but if you are uncertain, you can always check with:

Console
usethis::gh_token_help()
• GitHub host: 'https://github.com'
• Personal access token for 'https://github.com': <unset>
• To create a personal access token, call `create_github_token()`
• To store a token for current and future use, call `gitcreds::gitcreds_set()`
ℹ Read more in the 'Managing Git(Hub) Credentials' article:
  https://usethis.r-lib.org/articles/articles/git-credentials.html

If the output says that the token is <unset> like the above text does, that means we need to make Git and usethis aware of the token. We do that by typing the next function in the Console to create the token on GitHub (if you haven’t created one already).

Console

This function sends us to the GitHub “Generate new token” webpage with all the necessary settings checked. Set the “Expiry date” to 90 days (this is a good security feature). Then, click the green button at the bottom called “Generate token” and you’ll have a very long string generated for you that starts with ghp_. Save this token in your password manager (see note above). This is the token you will use every time you open up RStudio and interact with GitHub through R. You do not need to create a new token for each R project or package you make, you only need to create one after your current token expires (typically every couple of months), if you’ve forgotten the token or lost it, or if you’ve changed to a new computer.

In the Console, run:

Console
gitcreds::gitcreds_set()

And then copy and paste your token into the prompt in the Console. This token usually gets saved for the day (it gets cached), but after restarting your computer, you will need to run the action again. If it asks to replace an existing one, select the “yes” option. Doing this is a bit like using the two-factor authentication (2FA) you often have to do when, for instance, accessing your online bank account or other government website. In this case, you are telling GitHub (when interacting to it through RStudio, like uploading and downloading your changes) that you are who you claim to digitally be.

Tip

There is another great helper function that runs a lot of checks and gives some advice when it finds potential problems.

Console
usethis::git_sitrep()

Just to be aware, using this function outputs a lot of stuff, most of which you probably don’t even need to know or don’t even know what it means. That’s ok, since it is meant as a diagnostic tool.

3.8 Create an R Project

One of the basic steps to reproducibility and modern workflows in data analysis is to keep everything contained in a single location. In RStudio, this is done with R Projects. Please read all of this section from the introduction course to learn about R Projects and how they help keep things self-contained. You don’t need to do any of the exercises or activities.

There are several ways to organise a project folder. We’ll be using the structure from the package prodigenr. The project setup can be done by either:

  1. Using RStudio’s New Project menu item: “File -> New Project -> New Directory”, scroll down to “Scientific Analysis Project using prodigenr” and name the project “AdvancedR3” in the Directory Name, saving it to the “Desktop” with Browse. Note: You might need to restart RStudio if you don’t see this option.
  2. Or, running the command prodigenr::setup_project("~/Desktop/AdvancedR3") (or other location like Documents) in the R Console and manually switching to it using: File -> Open Project and navigating to the project location.

When the RStudio Project opens up again, run these commands in the R Console to finish the setup:

Console
# Add Git to the project
prodigenr::setup_with_git()
# Create a `functions.R` file in the `R/` folder
usethis::use_r("functions", open = FALSE)
# Ignore this file that gets created by some usethis functions
usethis::use_git_ignore(".Rbuildignore")
# Set some project options to start fresh each time
usethis::use_blank_slate("project")

Here we use the usethis package to help set things up. usethis is an extremely useful package for managing R Projects and we highly recommend checking it out more to see how you can use it in your own work.

3.9 R Markdown and Quarto

We teach and use R Markdown/Quarto because it is one of the very first steps to being reproducible and because it is a very powerful tool to doing data analysis. You may have heard of or used R Markdown since we’ve used it in both the introduction and intermediate courses. However, you might not have heard of or used Quarto.

Quarto is a next generation version of R Markdown and chances are, if you use a fairly recent version of RStudio, you are already using it without realizing it. That’s because Quarto uses the same Markdown syntax as R Markdown. The only difference is that with Quarto, you can create more types of output documents (like books, websites, slides), you have more options for customization, and it’s easier to do and learn than R Markdown. So, for this course, we will use Quarto to create a report that includes the analysis we will do over the course.

Please do these two tasks:

  1. Please read over the R Markdown/Quarto section of the introduction course. If you use R Markdown or Quarto already, you can skip this step.

  2. In the R Console while inside the AdvancedR3 project, run the function to create a new Quarto file called learning.qmd in the doc/ folder.

Console
r3::create_qmd_doc()

3.10 Download the course data

We’re going to use a real world dataset to demonstrate the concepts in this course. We’re going to use an openly licensed dataset with metabolomics data (1).

1.

Similar to the intermediate course, we will follow the principle of building a reproducible pipeline, from raw data to the report on the analysis. We’ll also follow the principle of keeping the raw data raw and use code to tidy it up. The ultimate goal is to have a record of how we exactly went from raw data to results in the paper. Unlike the intermediate course where you had to write through each step of the script and what you needed to do, in this course you only need to copy and paste the code that will download and minimally process the dataset into an initially workable state. If you want to learn more about how to take a raw data set and process it into a format that is more suitable for analysis, check out the intermediate course.

Inside the data-raw/ folder, we are going to write R code that downloads a dataset, processes it into a more tidy format, and save it into the data/ folder. This is the start of our analysis pipeline. First step, we need to create the script to do these steps. While in your AdvancedR3 R Project, go to the Console pane in RStudio and type out:

Console
usethis::use_data_raw("nmr-omics")

What this function does is create a new folder called data-raw/ and creates an R script called nmr-omics.R in that folder. This is where we will store the raw, original metabolomics data that we’ll get from the website. If you go to the website with the dataset, you’ll notice (when you scroll down) that there are three files: A README file, a metabolomics .xlsx file, and a lipidomics .xlsx file. For now, we only want the README file and lipidomics dataset.

The R script should have opened up for you, otherwise, go into the data-raw/ folder and open up the new nmr-omics.R script. The first thing to do is delete all the code in the script. Than, copy and paste the code below into the script.

data-raw/nmr-omics.R
# Load necessary packages -------------------------------------------------

library(readxl)
library(dplyr)
library(tidyr)
library(here)

# Download dataset --------------------------------------------------------

# From DOI: 10.5281/zenodo.6597902
# Direct URL: https://zenodo.org/record/6597902

# Get both README and the Lipidomics dataset.
nmr_omics_dir <- here("data-raw/nmr-omics")
fs::dir_create(nmr_omics_dir)

download.file("https://zenodo.org/record/6597902/files/README.txt",
  destfile = here(nmr_omics_dir, "README.txt")
)

download.file(
  "https://zenodo.org/record/6597902/files/NMR_Integration_Data_Lipidomics.xlsx",
  destfile = here(nmr_omics_dir, "lipidomics.xlsx"), mode = "wb"
)

# Wrangle dataset into tidy long format -----------------------------------

lipidomics_full <- read_xlsx(
  here(nmr_omics_dir, "lipidomics.xlsx"),
  col_names = paste0("V", 1:40)
)

# There are actually two sets of data in this dataset that we need to split:
# - Lipidomic data
# - Subject level data

# Keep only lipidomic values
lipidomics_only <- lipidomics_full |>
  # Want to remove columns 2, 3, and 4 since they are "limits"
  # (we don't need them for this course)
  select(-2:-4) |>
  # Remove the subject data rows
  slice(-1:-4) |>
  mutate(across(-V1, as.numeric)) |>
  # Make it so the metabolite values are all in one column,
  # which will make it easier to join with the subject data later.
  pivot_longer(-V1) |>
  rename(metabolite = V1)

# Keep only subject data
subject_only <- lipidomics_full |>
  # Remove the first metabolic name and limit columns,
  # don't need for this
  select(-1:-3) |>
  # Keep only the subject data raw
  slice(1:4) |>
  pivot_longer(cols = -V4) |>
  pivot_wider(names_from = V4, values_from = value) |>
  # There is a weird "​" before some of the numbers, so we have
  # extract just the number first before converting to numeric.
  mutate(Age = as.numeric(stringr::str_extract(Age, "\\d+"))) |>
  rename_with(snakecase::to_snake_case)

lipidomics <- full_join(
  subject_only,
  lipidomics_only
) |>
  # Don't need anymore
  select(-name)

# Save to `data/` ---------------------------------------------------------

readr::write_csv(lipidomics, here::here("data/lipidomics.csv"))

Since this is an advanced course, you can run the lines one at a time and see what they do on your own (or source() them all at once). The comments provided give guidance on what the code is doing and why. In the end, though, the only important thing is to run all the code and get the lipidomics dataset to be saved as data/lipidomics.csv. The created files should look like this:

AdvancedR3
├── data
│   └── lipidomics.csv
└── data-raw
    ├── nmr-omics
    │   ├── README.txt
    │   └── lipidomics.xlsx
    └── nmr-omics.R

And when using readr::read_csv(), the created dataset in data/lipidomics.csv should look like:

readr::read_csv(here::here("data/lipidomics.csv"))
# A tibble: 504 × 6
   code   gender   age class metabolite                value
   <chr>  <chr>  <dbl> <chr> <chr>                     <dbl>
 1 ERI109 M         25 CT    TMS (interntal standard) 208.  
 2 ERI109 M         25 CT    Cholesterol               19.8 
 3 ERI109 M         25 CT    Lipid CH3- 1              44.1 
 4 ERI109 M         25 CT    Lipid CH3- 2             147.  
 5 ERI109 M         25 CT    Cholesterol               27.2 
 6 ERI109 M         25 CT    Lipid -CH2-              587.  
 7 ERI109 M         25 CT    FA -CH2CH2COO-            31.6 
 8 ERI109 M         25 CT    PUFA                      29.0 
 9 ERI109 M         25 CT    Phosphatidylethanolamine   6.78
10 ERI109 M         25 CT    Phosphatidycholine        41.7 
# ℹ 494 more rows

Take a look through the downloaded data-raw/nmr-omics/README.txt and data-raw/nmr-omics/lipidomics.xlsx files, as well as the created data/lipidomics.csv to begin better understanding the dataset.

We are using Git to track changes made to files in our project. The original metabolomics dataset is stored on Zenodo, so we don’t actually need to keep the raw data files in our Git history. So let’s tell Git to ignore the files created in the data-raw/nmr-omics/ folder. In the Console, type out the code below. You only need to do this once.

Console
usethis::use_git_ignore("data-raw/nmr-omics/")

Next, run this command to check that everything is setup correctly. You will need to paste this output into the survey at the end.

Console
r3::check_project_setup_advanced()
Show folders and files of project:
• Please copy and paste this output into the survey question:
/home/luke/Desktop/AdvancedR3
├── AdvancedR3.Rproj
├── DESCRIPTION
├── R
│   ├── README.md
│   └── functions.R
├── README.md
├── TODO.md
├── data
│   ├── README.md
│   └── lipidomics.csv
├── data-raw
│   ├── README.md
│   ├── nmr-omics
│   │   ├── README.txt
│   │   └── lipidomics.xlsx
│   └── nmr-omics.R
└── doc
    ├── README.md
    ├── learning.qmd
    └── report.Rmd

The output should look something a bit like the above text. If it doesn’t, start over by deleting everything in the data-raw/ folder except for the data-raw/nmr-omics.R script and rerunning the script again. If your output looks a bit like the above, than copy and paste the output into the survey question at the end.

3.11 Set some quality of life options

Some of the most common “issues” we encounter in the course when it comes to Git are caused by files not being saved so the changes can’t be seen by Git. To help with this, we suggest turning on some options in RStudio. While, there are many options inside the Global Options in RStudio that can help you work better and faster, the two we want you to to use will help you and us out during the course:

  • Go into Tools -> Global Options -> Code -> Saving.
    • Under the heading “General”, tick on all of those check boxes.
    • Under the heading “Auto-save”, tick on both those check boxes.

3.12 Course introduction

Most of the course description is found in the syllabus (Chapter 1). If you haven’t read it, please read it now. Read over what the course will cover, what we expect you to learn at the end of it, and what our basic assumptions are about who you are and what you know. The final pre-course task is to complete a survey that asks if you’ve read it and if it matches you.

We have a Code of Conduct. If you haven’t read it, read it now. The survey at the end will ask about Conduct. We want to make sure this course is a supportive and safe environment for learning, so this Code of Conduct is important.

3.13 Brief pre-reading

An important component of learning is repeated exposure to a concept or skill. So, while we will tell you this information during the course, and in many cases you will also read it during the course, we want to introduce some repetition here by reading the sections list below. This will not only mentally prepare you for the content of the course, but will also give you a bigger overview of what you will be learning and doing. So, please read:

  1. The “big picture” in Section 4.1.
  2. Each sessions learning objectives:

3.14 Fill in survey

You’re almost done! 🎉 Please fill out the pre-course survey to finish this assignment.

See you at the course!

This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.

# Add Git to the project
prodigenr::setup_with_git()
# Create a `functions.R` file in the `R/` folder
usethis::use_r("functions", open = FALSE)
# Ignore this file that gets created by some usethis functions
usethis::use_git_ignore(".Rbuildignore")
# Set some project options to start fresh each time
usethis::use_blank_slate("project")
r3::create_qmd_doc()
usethis::use_data_raw("nmr-omics")
usethis::use_git_ignore("data-raw/nmr-omics/")