4 Introduction to course
The slides contain speaking notes that you can view by pressing ‘S’ on the keyboard.
4.1 The Big Picture
This section provides a bigger picture view of what we will be doing, why we want to do it, how we will be going about doing it, and what it will look like in the end.
Our big picture aim is:
We want to create a data analysis project that:
- Makes it easier for direct contributions from collaborators and others
- Allows us to write processing and analysis steps (as code) in a more explicit and ultimately reproducible way
- Incorporates general-purpose tools that simplify using (or switching) statistical analysis methods
All of which will be exemplified through a simple analysis of a lipidomics study.
How will it look like in the end, in a more “tangible” way? The most tangible thing are the folders and files on our computers. The folder and file structures below show where we start and where we end, so you can hopefully get a better understanding of how things look. Right now, everyone’s initial project structure should look like:
LearnR3
├── data/
│ ├── lipidomics.csv
│ └── README.md
├── data-raw/
│ ├── README.md
│ ├── nmr-omics/
│ │ ├── lipidomics.xlsx
│ │ └── README.txt
│ └── nmr-omics.R
├── doc/
│ ├── README.md
│ ├── learning.qmd
│ └── report.Rmd
├── R/
│ ├── functions.R
│ └── README.md
├── .gitignore
├── DESCRIPTION
├── LearnR3.Rproj
├── README.md
└── TODO.md
Why do we structure it this way?
- To follow “project-oriented” workflows (covered in Chapter 5).
- To follow some standard conventions in R, like having a
DESCRIPTION
file (which is important for Chapter 5). - To keep types of files separate, like raw data raw and in the
data-raw/
folder, R scripts/functions in theR/
folder, and documents like R Markdown / Quarto files indoc/
.
This also supports our workflow and processes, which will be something like:
- Install packages in a project-specific environment to track package dependencies.
- Follow “function-oriented” workflows, where we use R Markdown / Quarto (
doc/learning.qmd
) to write and test out code, convert it into a function, test it, and then move it intoR/functions.R
.- Developing functions in the R Markdown document makes it a bit easier to quickly test code out and make sure it works before moving the code over into a more formal location and structure. Think of using the R Markdown file as a sandbox to test out and play with code, without fear of messing things up.
- We also test code out in the R Markdown from a teaching and learning perspective because it’s easier to show and weave in text and comments with code as we do the code-alongs. It also forces you to practice working within R Markdown documents, which are key components to a reproducible workflow in R.
- We keep the functions in a separate file because we will frequently
source()
from it as we prototype and test out code in the R Markdown file. It also creates a clear separation between “finalized” code and prototyping code.
- Use a combination of restarting R with Ctrl-Shift-F10Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “restart”) (or
Session -> Restart R
) and usingsource()
(Ctrl-Shift-SCtrl-Shift-S or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “source”) while inR/functions.R
) to run the functions inside ofR/functions.R
.- We restart R because it is the only certain way to ensure that the R workspace is completely clear. For reproducibility, we should always aim to work from a “clean plate”.
- Keeping code readable by frequently (automatically) fixing the formatting/styling of our code.
- For each “output” in a paper (like a figure or table), write one or more functions to complete the output and include those functions as “targets” or steps in a pipeline. Use this explicit pipeline to track which steps need to be re-run and in which order for your data analysis.
- Write accompanying text (or full paper) for the analysis in Markdown so we can easily and quickly regenerate reports for rapid dissemination.
- Automatically reformat Markdown text into a standard, more readable format.
- Whenever we complete a task, we add and commit those file changes and save them into the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
- We use Git because it is the best way of keeping track of what was done to your files, when, and why. It keeps your work transparent and makes it easier to share your code by uploading to GitHub. Using version control should be a standard practice to doing better science since it fits with the philosophy of doing science.
Many of these “project setup” tasks can take a while to set up, can often be very difficult and confusing. This is before you’ve even gotten to the analysis phase of your work. A good analogy for this first step is when skyscrapers are built. Watching construction on these projects makes it feel like it takes forever for them to finally start going up and adding floors. But once they start adding floors, it goes up so fast! That’s because a lot of the main work is in building up the foundation of the building, so that it is strong and stable. This is the same with analysis projects, the first phase feels like nothing is “moving” but you are building the foundation to everything that comes after.
If you finish exercises faster than others in the course, try your hand at working through the Extra Exercises section. It is still a work-in-progress, so it is a bit small right now.