In order to participate in this course, you must complete the pre-course tasks in this section as well as completing the survey at the end. These tasks are designed to make it easier for everyone to start the course with everything ready to go. For some of the tasks, you might not understand why you need to do them, but you will likely understand why once the course begins.
Depending on your skills and knowledge, these tasks could take between 5-7 hrs to finish, so we suggest planning a full day to complete them. Depending on your institution and how they handle installing software on work computers, you also might have to contact IT very early to make sure everything is properly installed and set up.
3.1 List of tasks
Here’s a quick overview of the tasks you need to do. Specific details about them are found as you work through this section.
Read the learning objectives in Section 3.2 for the pre-course tasks.
Read about how to read this website in Section 3.3.
Install the necessary programs and the right versions of those programs in Section 3.4. For some people, depending on their institution, this task can take the longest amount of time because you have to contact your IT to install these packages.
Correctly set up Git on your computer in Section 3.6, if you haven’t done this already from previous courses. If you haven’t used Git before, this task could take a while because of the reading.
Run a check with r3::check_setup() to see if everything works. You’ll later need to paste this output into the survey.
Create an R Project in Section 3.8, along with the folder and file setup.
Create a Quarto file.
Write (well, mostly copy and paste) R code to download the data and save it to your computer. This task will probably take up maybe 30-60 minutes depending on your interest in exploring the data.
Run a check using r3::check_project_setup_advanced() to see that everything is as expected. You’ll later need to paste this output into the survey.
Complete the pre-course survey. This survey is pretty quick, maybe ~10 minutes.
Check each section for exact details on completing these tasks.
3.2 Learning objective
In general, these pre-course tasks are meant to help prepare you for the course and make sure everything is setup properly so the first session runs smoothly. However, some of these tasks are meant for learning as well as for general setup, so we have defined the following learning objectives for this page:
Learn about and then apply some basic reproducible workflows and setups for the initial processing of raw data. For those who have already participated in the intermediate R course, the objective is to review what you previously learned.
3.3 Reading the course website
We will explain this a bit during the course, but read this to start learning how the website is structured and how to read certain things. Specifically, there are a few “syntax” type formatting of the text in this website to be aware of:
Folder names always end with a /, for example data/ means the data folder.
R variables are always shown as is. For instance, for the code x <- 10, x is a variable because it was assigned with 10.
Functions always end with (), for instance mean() or read_csv().
Sometimes functions have their package name appended with :: to indicate to run the code from the specific package, since we likely haven’t loaded the package with library(). For instance, to install packages from GitHub using the pak package we use pak::pkg_install("user/packagename"). You’ll learn about this more later.
Reading tasks always start with a statement “Reading task” and are enclosed in a “callout block”. Read within this block. We will usually go over the section again to reinforce the concepts and address any questions.
3.4 Installing the latest programs
The first thing to do is to install these programs. You may already have some of them installed and if you do, please make sure that they are at least the minimum versions listed below. If not, you will need to update them.
R: Any version above 4.1.2. If you have used R before, you can confirm the version by running R.version.string in the Console.
RStudio: Any version above 2023.06.2. If you have installed it before, check the current version by going to the menu Help -> About RStudio.
Git: Select the “Click here for download” link. Git is used throughout many sessions in the courses. When installing, it will ask for a selecting a “Text Editor” and while we won’t be using this in the course, Git needs to know this information so choose Notepad.
Rtools: Version that says “R-release”. Rtools is needed in order to build some R packages. For some computers, installing Rtools can take some time.
R: Any version above 4.1.2. If you have used R before, you can confirm the version by running R.version.string in the Console. If you use Homebrew, installing R is as easy as opening a Terminal and running:
brew install --cask r
RStudio: Any version above 2023.06.2. If you have installed it before, check the current version by going to the menu Help -> About RStudio. With Homebrew:
brew install --cask rstudio
Git: Git is used throughout many sessions in the courses. With Homebrew:
brew install git
R: Any version above 4.1.2. If you have used R before, you can confirm the version by running R.version.string in the Console.
sudo apt -y install r-base
RStudio: Any version above 2023.06.2. If you have installed it before, check the current version by going to the menu Help -> About RStudio.
Git: Git is used throughout many sessions in the courses.
sudo apt install git
All these programs are required for the course, even Git. Git, which is a software program to formally manage versions of files, is used because of it’s popularity and the amount of documentation available for it. Check out the online book Happy Git with R, especially the “Why Git” section, for an understanding on why we are teaching Git. Windows users tend to have more trouble with installing Git than macOS or Linux users. See the section on Installing Git for Windows for help.
Note
Some pictures may show a Git pane in RStudio, but you may not see it. If you haven’t created or opened an RStudio R Project (which is taught in the introductory course), the Git pane does not show up. It only shows up in R Projects that use Git to track file changes.
Note
A note to those who have or use work laptops with restrictive administrative privileges: You may encounter problems installing software due to administrative reasons (e.g. you don’t have permission to install things). Even if you have issues installing or updating the latest version of R or RStudio, you will likely be able to continue with the course as long as you have the minimum version listed above for R and for RStudio. If you have versions of R and RStudio that are older than that, you may need to ask your IT department to update your software if you can’t do this yourself. Unfortunately, Git is not a commonly used software for some organizations, so you may not have it installed and you will need to ask IT to install it. We require it for the course, so please make sure to give IT enough time to be able to install it for you prior to the course.
Once R, RStudio, and Git have been installed, open RStudio. If you encounter any troubles during these pre-course tasks, try as best as you can to complete the task and then let us know about the issues in the pre-course survey of the course. If you continue having problems, indicate on the survey that you need help and we can try to book a quick video call to fix the problem. Otherwise, you can come to the course 15-20 minutes earlier to get help.
If you’re unable to complete the setup procedure due to unfixable technical issues, you can use Posit Cloud (to use RStudio on the cloud) as a final solution in order to participate in the course. For help setting up Posit Cloud for this course, refer to the Posit Cloud setup guide.
3.5 Installing the R packages
We will be using specific R packages for the course, so you will need to install them. A detailed walkthrough for installing the necessary packages is available on the pre-course tasks for installing packages section of the introduction course, however, you only need to install the r3 helper package in order to install all the necessary packages by running these commands in the R Console:
You might encounter an error when running this code. That’s ok, you can fix it if you restart R by going to Sessions -> Restart R and re-run the code in items 2 and 3, it should work. If it still doesn’t, try to running:
remotes::install_github("rostools/r3")
If that also doesn’t work, try to complete the other tasks, complete the survey, and let us know you have a problem in the survey.
Note: When you see a command like something::something(), for example with r3::install_packages_advanced(), you would “read” this as:
R, can you please use the install_packages_advanced() function from the r3 package.
The normal way of doing this would be to load the package with library(r3) and then running the command (install_packages_advanced()). But by using the ::, we tell R to directly use a function from a package, without needing to load the package and all of its other functions too. We use this trick because we only want to use the install_packages_advanced() command from the r3 package and not have to load all the other functions as well. In this course we will be using :: often.
3.6 Setting up Git
Since Git has already been covered in the previous courses, we won’t cover learning it during this course. However, since version control is a fundamental component of any modern data analysis workflow and should be used, we will be using it throughout the course. If you have used or currently use Git and GitHub, you can skip these two tasks. If you have not used it, please do these tasks:
# There will be a pop-up to type in your name (first and # last), as well as your emailr3::setup_git_config()
Please read through the Version Control lesson of the introduction course. You don’t need to do any of the exercises or activities, but you are welcome to do them if it will help you learn or understand it better. For most of the course, we will be using Git as shown in the Using Git in RStudio section. During the course, we will connect our projects to GitHub, which is described in the Synchronizing with GitHub section.
Regardless of whether you’ve done the steps above or not, everyone needs to run:
Console
r3::check_setup()
The output you’ll get for success will look something like this:
Checking R version:
✔ Your R is at the latest version of 4.4.1!
Checking RStudio version:
✔ Your RStudio is at the latest version of 2024.9.0.375!
Checking Git config settings:
✔ Your Git configuration is all setup!
Git now knows that:
- Your name is 'Luke W. Johnston'
- Your email is 'lwjohnst@gmail.com'
Eventually you will need to copy and paste the output into one of the survey questions. Note that while GitHub is a natural connection to using Git, given the limited time available, we will only be going over aspects of GitHub that relate to storing your project Git repository and working together. If you want to learn more about using GitHub, check out the session on it in the introduction course.
3.7 Connect with GitHub
Because we’ll be pushing and pulling to GitHub throughout the course, as well as using GitHub to collaborate with others in the project work, you need to setup your computer to connect with GitHub. Read through and complete the tasks in the section authenticating with GitHub of the Connect to GitHub Guide
3.8 Create an R Project
One of the basic steps to reproducibility and modern workflows in data analysis is to keep everything contained in a single location. In RStudio, this is done with R Projects. Please read all of this section from the introduction course to learn about R Projects and how they help keep things self-contained. You don’t need to do any of the exercises or activities.
There are several ways to organise a project folder. We’ll be using the structure from the package prodigenr. The project setup can be done by either:
Using RStudio’s New Project menu item: “File -> New Project -> New Directory”, scroll down to “Scientific Analysis Project using prodigenr” and name the project “AdvancedR3” in the Directory Name, saving it to the “Desktop” with Browse. Note: You might need to restart RStudio if you don’t see this option.
Or, running the command prodigenr::setup_project("~/Desktop/AdvancedR3") (or other location like Documents) in the R Console and manually switching to it using: File -> Open Project and navigating to the project location.
When the RStudio Project opens up again, run these commands in the R Console to finish the setup:
Console
# Add Git to the projectprodigenr::setup_with_git()# Create a `functions.R` file in the `R/` folderusethis::use_r("functions", open =FALSE)# Ignore this file that gets created by some usethis functionsusethis::use_git_ignore(".Rbuildignore")# Set some project options to start fresh each timeusethis::use_blank_slate("project")
Here we use the usethis package to help set things up. usethis is an extremely useful package for managing R Projects and we highly recommend checking it out more to see how you can use it in your own work.
3.9 R Markdown and Quarto
We teach and use R Markdown/Quarto because it is one of the very first steps to being reproducible and because it is a very powerful tool to doing data analysis. You may have heard of or used R Markdown since we’ve used it in both the introduction and intermediate courses. However, you might not have heard of or used Quarto.
Quarto is a next generation version of R Markdown and chances are, if you use a fairly recent version of RStudio, you are already using it without realizing it. That’s because Quarto uses the same Markdown syntax as R Markdown. The only difference is that with Quarto, you can create more types of output documents (like books, websites, slides), you have more options for customization, and it’s easier to do and learn than R Markdown. So, for this course, we will use Quarto to create a report that includes the analysis we will do over the course.
Please do these two tasks:
Please read over the R Markdown/Quarto section of the introduction course. If you use R Markdown or Quarto already, you can skip this step.
In the R Console while inside the AdvancedR3 project, run the function to create a new Quarto file called learning.qmd in the doc/ folder.
Console
r3::create_qmd_doc()
3.10 Download the course data
We’re going to use a real world dataset to demonstrate the concepts in this course. We’re going to use an openly licensed dataset with metabolomics data (1).
Similar to the intermediate course, we will follow the principle of building a reproducible pipeline, from raw data to the report on the analysis. We’ll also follow the principle of keeping the raw data raw and use code to tidy it up. The ultimate goal is to have a record of how we exactly went from raw data to results in the paper. Unlike the intermediate course where you had to write through each step of the script and what you needed to do, in this course you only need to copy and paste the code that will download and minimally process the dataset into an initially workable state. If you want to learn more about how to take a raw data set and process it into a format that is more suitable for analysis, check out the intermediate course.
Inside the data-raw/ folder, we are going to write R code that downloads a dataset, processes it into a more tidy format, and save it into the data/ folder. This is the start of our analysis pipeline. First step, we need to create the script to do these steps. While in your AdvancedR3 R Project, go to the Console pane in RStudio and type out:
What this function does is create a new folder called data-raw/ and creates an R script called nmr-omics.R in that folder. This is where we will store the raw, original metabolomics data that we’ll get from the website. If you go to the website with the dataset, you’ll notice (when you scroll down) that there are three files: A README file, a metabolomics .xlsx file, and a lipidomics .xlsx file. For now, we only want the README file and lipidomics dataset.
The R script should have opened up for you, otherwise, go into the data-raw/ folder and open up the new nmr-omics.R script. The first thing to do is delete all the code in the script. Than, copy and paste the code below into the script.
data-raw/nmr-omics.R
# Load necessary packages -------------------------------------------------library(readxl)library(dplyr)library(tidyr)library(here)# Download dataset --------------------------------------------------------# From DOI: 10.5281/zenodo.6597902# Direct URL: https://zenodo.org/record/6597902# Get both README and the Lipidomics dataset.nmr_omics_dir<-here("data-raw/nmr-omics")fs::dir_create(nmr_omics_dir)download.file("https://zenodo.org/record/6597902/files/README.txt", destfile =here(nmr_omics_dir, "README.txt"))download.file("https://zenodo.org/record/6597902/files/NMR_Integration_Data_Lipidomics.xlsx", destfile =here(nmr_omics_dir, "lipidomics.xlsx"), mode ="wb")# Wrangle dataset into tidy long format -----------------------------------lipidomics_full<-read_xlsx(here(nmr_omics_dir, "lipidomics.xlsx"), col_names =paste0("V", 1:40))# There are actually two sets of data in this dataset that we need to split:# - Lipidomic data# - Subject level data# Keep only lipidomic valueslipidomics_only<-lipidomics_full|># Want to remove columns 2, 3, and 4 since they are "limits"# (we don't need them for this course)select(-2:-4)|># Remove the subject data rowsslice(-1:-4)|>mutate(across(-V1, as.numeric))|># Make it so the metabolite values are all in one column,# which will make it easier to join with the subject data later.pivot_longer(-V1)|>rename(metabolite =V1)# Keep only subject datasubject_only<-lipidomics_full|># Remove the first metabolic name and limit columns,# don't need for thisselect(-1:-3)|># Keep only the subject data rawslice(1:4)|>pivot_longer(cols =-V4)|>pivot_wider(names_from =V4, values_from =value)|># There is a weird "" before some of the numbers, so we have# extract just the number first before converting to numeric.mutate(Age =as.numeric(stringr::str_extract(Age, "\\d+")))|>rename_with(snakecase::to_snake_case)lipidomics<-full_join(subject_only,lipidomics_only)|># Don't need anymoreselect(-name)# Save to `data/` ---------------------------------------------------------readr::write_csv(lipidomics, here::here("data/lipidomics.csv"))
Since this is an advanced course, you can run the lines one at a line and see what they do on your own (or source() them all at once). The comments provided give guidance on what the code is doing and why. In the end, though, the only important thing is to run all the code and get the lipidomics dataset to be saved as data/lipidomics.csv. The created files should look like this:
# A tibble: 504 × 6
code gender age class metabolite value
<chr> <chr> <dbl> <chr> <chr> <dbl>
1 ERI109 M 25 CT TMS (interntal standard) 208.
2 ERI109 M 25 CT Cholesterol 19.8
3 ERI109 M 25 CT Lipid CH3- 1 44.1
4 ERI109 M 25 CT Lipid CH3- 2 147.
5 ERI109 M 25 CT Cholesterol 27.2
6 ERI109 M 25 CT Lipid -CH2- 587.
7 ERI109 M 25 CT FA -CH2CH2COO- 31.6
8 ERI109 M 25 CT PUFA 29.0
9 ERI109 M 25 CT Phosphatidylethanolamine 6.78
10 ERI109 M 25 CT Phosphatidycholine 41.7
# ℹ 494 more rows
Take a look through the downloaded data-raw/nmr-omics/README.txt and data-raw/nmr-omics/lipidomics.xlsx files, as well as the created data/lipidomics.csv to begin better understanding the dataset.
We are using Git to track changes made to files in our project. The original metabolomics dataset is stored on Zenodo, so we don’t actually need to keep the raw data files in our Git history. So let’s tell Git to ignore the files created in the data-raw/nmr-omics/ folder. In the Console, type out the code below. You only need to do this once.
Next, run this command to check that everything is setup correctly. You will need to paste this output into the survey at the end.
Console
r3::check_project_setup_advanced()
Show folders and files of project:
• Please copy and paste this output into the survey question:
/home/luke/Desktop/AdvancedR3
├── AdvancedR3.Rproj
├── DESCRIPTION
├── R
│ ├── README.md
│ └── functions.R
├── README.md
├── TODO.md
├── data
│ ├── README.md
│ └── lipidomics.csv
├── data-raw
│ ├── README.md
│ ├── nmr-omics
│ │ ├── README.txt
│ │ └── lipidomics.xlsx
│ └── nmr-omics.R
└── doc
├── README.md
├── learning.qmd
└── report.Rmd
The output should look something a bit like the above text. If it doesn’t, start over by deleting everything in the data-raw/ folder except for the data-raw/nmr-omics.R script and re-running the script again. If your output looks a bit like the above, than copy and paste the output into the survey question at the end.
3.11 Set some quality of life options
Some of the most common “issues” we encounter in the course when it comes to Git are caused by files not being saved so the changes can’t be seen by Git. To help with this, we suggest turning on some options in RStudio. While, there are many options inside the Global Options in RStudio that can help you work better and faster, the two we want you to to use will help you and us out during the course:
Go into Tools -> Global Options -> Code -> Saving.
Under the heading “General”, tick on all of those check boxes.
Under the heading “Auto-save”, tick on both those check boxes.
3.12 Course introduction
Most of the course description is found in the syllabus (Chapter 1). If you haven’t read it, please read it now. Read over what the course will cover, what we expect you to learn at the end of it, and what our basic assumptions are about who you are and what you know. The final pre-course task is to complete a survey that asks if you’ve read it and if it matches you.
We have a Code of Conduct. If you haven’t read it, read it now. The survey at the end will ask about Conduct. We want to make sure this course is a supportive and safe environment for learning, so this Code of Conduct is important.
3.13 Brief pre-reading
An important component of learning is repeated exposure to a concept or skill. So, while we will tell you this information during the course, and in many cases you will also read it during the course, we want to introduce some repetition here by reading the sections list below. This will not only mentally prepare you for the content of the course, but will also give you a bigger overview of what you will be learning and doing. So, please read: