Many of you probably work largely and most consistently on your own, but as your move through your career (in academia or industry), you will need to and maybe also want to directly collaborate1 a lot more with others. Different types of collaboration (e.g. meetings, brainstorming, real-time co-writing) form the basis for almost all research-based work and probably most non-research-based work.
1 Collaborate here meaning directly contributing to a shared project, rather than discussed or planning based collaborations (and definitely not emailing-files-around collaboration).
More direct collaboration on a project quickly becomes unmanageable when using traditional academic “workflows” (emailing around). That’s when you need to start using tools designed for collaboration, like Git. But Git is just the starting point. There are many many other things to consider for workflows and processes to effectively collaborate with others. This session is about making use of more automated ways of structuring data analysis projects to ease collaboration.
5.1 Learning objectives
The overall objective for this session is to:
Identify potential actions to streamline collaboration on a data analysis project and create projects that apply many of these actions using R.
More specific objectives are to:
Explain what a “project-oriented” workflow is, what a project-level R dependency management is, and why these concepts are important to consider in collaborative and reproducible analyses.
Describe the difference between “workflow dependencies” and “build dependencies”.
Apply functions in the usethis R package to implement these dependency management concepts.
Explain the role that following a style guide has on building a common approach to reading (and writing) R code, and thus improve project-level collaboration.
Use styler and RStudio’s canonical markdown mode to programmatically check and apply style guides to your project files.
5.2 Exercise: How do you exactly collaborate or contribute? To your own or others’ projects.
Time: ~10 minutes.
When you work on a project (for your thesis or a manuscript), how exactly do you and your collaborators contribute to the project? Is it mostly verbal contributions? Do you use a shared folder that the files are on? How do you keep track of who’s changed what? Do you mostly work on your own and contributions are largely verbal or written feedback (like in a meeting or through an email)? If you work directly on a project, how do you coordinate things? Does one collaborator work on one section or analysis, so your files are separate? Do you ever have to go in and contribute your own code to theirs (and vice versa)?
Take about 1 minute to think on these questions.
For 6 minutes, discuss these questions with your neighbour, and talk about your own experiences.
For the remaining time, we will share briefly with everyone.
5.3 Project-level R dependency management
Note: This first session is more conceptual and is heavier on the reading and explanation, but is important for the next sessions.
Reading task: ~8 minutes
One of the first things to consider when working collaboratively on a data analysis project (and probably other types of projects too) is what software to use for your project. This starts out at the highest level: Are you using R or some other software for the analysis? Since this is an R course, we’re assuming the software will be R! 😜
The next consideration is which packages your project depends on to produce the results. When working collaboratively with others, and yourself several months in the future, you need some way of knowing how to easily and quickly install or update these package dependencies.
Part of this approach requires that you follow a “project-oriented” workflow when working on, well, your project. In order to know how to track your project’s package dependencies, you need to first know, what is a “project” and how do we work around it? Since the introduction course’s first session on the Management of R Projects, we’ve consistently taught and used this workflow-style. In fact, it is embedded into the use of the R Projects via the .Rproj files and in the use of the here package. So we already are following this approach from the start, which will make it easier to track package dependencies of our project.
Let’s start with the AdvancedR3 project that uses the lipidomics. We have code in the data-raw/nmr-omics.R file that uses some packages. Let’s assume that your project will be more complex than this and that you will eventually need some collaborators to contribute who are experts in for instance metabolomics data processing and in statistical analysis of high-dimensional data. You know you will end up needing to use other packages. You also know that you all need some way of tracking which packages are used so that when others join and contribute to the project, they can as seamlessly as possible install or update the packages your data analysis project needs. There are a few ways of “tracking” package dependencies.
The simplest, but most primitive way is to always make sure to use library() at the top of each R script for each package that the R script uses.
Advantage:
This is the easiest to conceptually understand and to use.
Disadvantages:
It doesn’t track project-level dependencies very well, since multiple scripts probably use similar packages across them. Which means you can’t easily and quickly install or update all the packages your project uses, since you will probably have to go through each R script manually and install each package manually. You might have seen some scripts with code that looks like this at the top:
This code checks if a package exists, if not, it installs it. But! This is not an optimal method to track packages because require() won’t load the package if it doesn’t find it. Which means you would have to re-run the script probably a few times. Plus, sometimes you may need to restart the R session after installing a package in order for R to detect it properly.
It doesn’t track the versions of the packages your project depends on, so if a package gets updated and it breaks something, you might not be able to figure out how to quickly fix that issue, especially for those deadline crunches.
5.4 Formal dependency management
Reading task: ~3 minutes
The most common form, at least based on R packages and projects found on GitHub, of more formal dependency management makes use of the DESCRIPTION file and usethis::use_package() to track if a package is used for a project or not. We covered this style of dependency in the intermediate course. We will also use this approach during this course, but expand a lot more on it. We use and recommend it because a lot of tools and workflows exist that make use of it, so it is well integrated in working on R projects.
Advantages:
Relatively easy to conceptually understand, since you can directly view the packages your project needs by opening the DESCRIPTION file and looking at the contents.
Because it is widely used, there are many processes already built around making use of tracking dependencies this way. For instance, you need to track package dependencies when creating R packages.
Installing packages is as easy as opening the project and running pak::pak() in the Console, which will install all the packages listed in the DESCRIPTION file.
Adding packages that you need is as easy as writing usethis::use_package("packagename") in the Console.
Disadvantages:
Like the previous method, it doesn’t easily keep track of the versions of the packages you are using.
Your project might still rely on a package that is installed on your computer and that influences your project, but that might not be obvious as a dependency or that you forgot to include.
There is an even more formal and structured way of tracking dependencies using the renv package, that we originally covered in previous versions of this course but we won’t be covering anymore. It is a great package to use, especially if you are working on a project with several others or that require tracking not just packages but their specific versions. We don’t teach it anymore because it is quite complex, difficult to understand, and for the majority of projects is never needed.
As we work on the project and realize we need to use a specific package, we will continue using usethis::use_package() to install it and add it to the DESCRIPTION file. That also allows us to easily install all the packages our project depends on by using pak::pak().
Before continuing to the exercise, we need to make sure to add and comment all the files from the project into the Git history. Open the Git interface by either typing with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”) or by going to the Git pane and clicking the “Commit” button.
5.5 Exercise: Add packages from the data processing script
Time: ~10 minutes.
Since the DESCRIPTION file will be used later on for the more formal dependency management, let’s get it updated with the packages we are using in the data-raw/nmr-omics.R script. Open that file and complete these tasks:
Manually look for package dependencies in the R script that are declared with library() and ::. It can help to use the “Find in files” feature in RStudio to look for all places that have either :: or library. Use either Edit -> Find in Files ..., or with Ctrl-Shift-FCtrl-Shift-F or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “find files”) to open up the search popup.
In the Console, run usethis::use_package() for each package you find in data-raw/nmr-omics.R (from 1. above).
Since we are also using the usethis package to manage these things, run usethis::use_package("usethis") to also add it as a package dependency.
Once done, open the Git interface with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”) (or going to the Git Pane and clicking the “Commit” button). What has been changed? Commit those changes to the Git history .
Click for the solution. Only click if you are really struggling or are out of time for the exercise.
Make sure that everyone has added the right packages, since it can be easy to miss some of the packages referenced using ::.
5.6 Two types of dependencies
Instructor note
Verbally explain the content of the next few paragraphs (don’t need to show the website on the projector).
When you work on a research project that involves data analysis, you likely use packages in two different ways:
Packages that directly contribute to data wrangling, analysis, plotting, and making the manuscript. These types of packages are generally called “build” or “deploy” dependencies. A package like dplyr or tidyr would be build dependencies, since you use them for processing data.
Packages that assist you in doing your work but aren’t directly used for data analysis. These types of packages would be called “workflow” or “development” dependencies. usethis would be considered a workflow dependency.
A good way to determine if a package is a build dependency for your project is by seeing if you write and use functions from the package within an R script that does something to the data or analysis. If you only ever use functions from the package in the Console, than it is likely a workflow dependency.
The way you add these packages is different depending on the type it is. For build dependencies, we use the function we’ve already used before: usethis::use_package("packagename"). For workflow dependencies, it’s the same function, but with a small difference. BUT! Before we cover it, let’s add a setting to our .Rprofile to make our life a bit easier. We will be using usethis functions many times throughout the course, so a simple quality-of-life fix is to make it so we don’t always have to do usethis::. Thankfully, there is a function that can help. Run the usethis::use_usethis() function to open your own (user) .Rprofile and add some code into it.
Let’s restart R with Ctrl-Shift-F10Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “restart”) before using use_package() to add usethis as a workflow dependency.
Open the Git interface and see that under Suggests: in the DESCRIPTION file is usethis. Let’s commit these changes, with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
Instructor note
For the information block below, mention it to the learners but you don’t need to go over it. Especially mention the second part of the tip.
Tip
When you come back to a project after a few months or if you start collaborating on a project, you can run pak::pak() and it should install all types of dependencies, both workflow and build.
5.7 Exercise: Connect your project to GitHub
Time: ~25 minutes.
Since we will eventually connect our project Git repository to GitHub to practice the workflow, we’ll connect our project to GitHub right now.
Let’s complete these tasks to connect to GitHub.
First, commit the latest changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
If you haven’t yet, please create a GitHub account.
Add the gitcreds package as a workflow dependency. You’ll need it for the next item
Check your GitHub to make sure the project repository has been uploaded to it.
5.8 Automatically adhere to a style guide
Reading task: ~5 minutes
We’ve covered how to adhere to a style guide in both the introductory as well as the intermediate course, because it is such a useful and powerful tool to easily write more readable code. We’ll expand on it in this course because it fits precisely with the theme of collaboration. That’s because, when you’re working on your own and not needing to worry about anyone seeing your code, there’s a natural temptation to write your code like you might write notes to yourself… scribbled and scrawled down quickly. But when working with others, how it looks can greatly impact how quickly and easily others are able to read and interpret your code. Multiply all the collaborators on a project with this natural temptation for quick (and sloppy) coding, you can imagine how easy it is for code to massively “drift” towards being poorly formatted, especially when deadlines are close.
That’s when “linters” or “stylers” (types of “static code analysis” tools) become very useful. They will scan your code for common mistakes or syntax problems and either list them out for you to fix or fix them for you automatically. Linters are great when you are collaborating on a project with collaborators who are not as experienced in writing code or who only occasionally contribute so don’t know the workflow culture of your project. In this way, you might want to have automatic linting/styling checks that are independent of you having to run them yourself. This is where the styler package comes in!
Since we will use it for the project as a workflow dependency, let’s add it to the DESCRIPTION file.
There are only a few functions in styler that we need to use. The first is to style a single file by using styler::style_file(). However, an easier function is the “style active file” RStudio addin. Usually we would only need to style the file that we are actually working on, We can do that through the Command Palette (Ctrl-Shift-PCtrl-Shift-P) and typing “style file”, which should show the “Style active file” option. This is what we will do frequently throughout the course.
Let’s try it out. While inside the data-raw/nmr-omics.R file, use the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “style file”) to style the file. There won’t be any changes since the file is already tidy.
Note
You will probably be asked to install a package called {miniUI}, click “Yes”.
If you wanted to run styler on all the files, we can use:
The thing to note, though, is that styler isn’t perfect, so you might sometimes have to manually fix run the reformatting (with the Command Palette Ctrl-Shift-PCtrl-Shift-P then “reformat”).
Instructor note
Mention the callout block below, but don’t go into it at all.
Tip
You might be used to using 4 spaces for tabs instead of 2. The tidyverse style uses 2, so the default option in styler
For multi-person collaborative projects, having some type of code styling and checker can really help with standardizing how the code looks, which ultimately will make it easier to read each other’s code contributions.
But what about for Markdown files? While there isn’t a package or function (yet) that styles the Markdown files, RStudio does have an option in their Tools to format Markdown into a “canonical form”. The reason for this option is because they added a “visual editor mode” to writing R Markdown files (which is great if you are more comfortable with apps like Word). Let’s test out this option. First, let’s make sure everything has been committed to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
Warning
Use this option only if you have your project under Git version control, since it will directly modify and overwrite the contents of the entire file.
There are two ways of doing this:
Going into Tools -> Project Options -> R Markdown and changing the options “Automatic text wrapping” to “column” (with the default 72 width value) and “Write canonical visual mode markdown” to “true”.
Or setting YAML options in either the project-level _quarto.yml file or at the file-level in the YAML header.
For right now, we will do the project option settings so that as long as we are using RStudio and in the R Project, it will automatically reformat the Markdown files as you write in them. Follow the instructions in item 1 above to set the options.
Now, when you save your file, RStudio should automatically reformat the Markdown into a standardized format. If you want to switch to using the Visual Mode, use Ctrl-Shift-F4Ctrl-Shift-F4 or the “Visual” button at the top of the Source Pane beside the bolding and italicizing buttons.
The instructors won’t be using the Visual Mode during the course, however you are welcome to. We will be using the “canonical” markdown mode though.
Let’s test it out. While in the doc/learning.qmd file, go to the bottom of the file and type out:
## This is poorly formatted- Definitely should have an empty space above this list.- This isn't a list, why not?
Save the file. What happens? Lists in Markdown need to have an empty space above them to work properly (except for when below a header, but in all other cases it needs a space above). With this canonical mode on, we can get feedback right away that it isn’t right. It gets automatically fixed by adding that empty space.
## This is poorly formatted- Definitely should have an empty space above this list.- This isn't a list, why not?
Since this mode is on automatically in this project, as we work in the doc/learning.qmd file through the sessions, we’ll get lots of experience using it.
5.10 Exercise: Update the README file, while using canonical markdown mode
Time: ~10 minutes.
Open up the README.md file and start completing the TODO items. Save often and watch as the Markdown gets reformatted. After you are done, commit the changes you made to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”). Then delete the TODO.md, followed by committing these deletions in the Git history. Click the “Push” button to push the changes to GitHub.
5.11 Summary
Track your project package dependencies with the DESCRIPTION file and usethis and combine it with options() to automatically make snapshots so you can use the use_package() function.
Install the necessary dependencies with pak::pak().
Follow a style guide by using styler. Combine with the Command Palette (Ctrl-Shift-PCtrl-Shift-P) to quickly run their functions on code you are actively working on.
Use RStudio’s canonical markdown mode to reformat Markdown into a standard format.