15Dependency management for smoother collaboration
Many of you probably work mostly on your own, but as you move through your career (whether in academia or industry), you will need to—and perhaps want to—directly collaborate1 much more with others.
1 Here, “directly collaborate” means contributing to a shared project directly (like editing the same files), rather than discussion- or planning-based collaborations (and definitely not the “emailing-files-around” type of collaboration).
Even though different types of collaboration (e.g. meetings, brainstorming, real-time co-writing) form the basis for almost all research-based work (and probably most non-research-based work), direct collaboration quickly becomes unmanageable when relying on traditional academic “workflows” (like emailing files around). That’s when you’ll need tools designed for collaboration, such as Git and GitHub. But that’s just the starting point; many other factors related to workflows and processes must be considered to collaborate more effectively.
This session focuses on using more automated ways to structure data analysis projects and facilitate smoother collaboration.
15.1 Learning objectives
This session’s overall learning outcome is to:
Identify actions to streamline collaboration on a data analysis project and create projects with R that apply many of these actions.
Specific objectives are to:
Explain what a “project-oriented” workflow is, what a project-level R dependency management is, and why these concepts are important for collaborative and reproducible analyses.
Describe the difference between workflow dependencies and build dependencies, and apply functions in the usethis R package to implement these dependency management concepts.
Explain how following a style guide helps build a common approach to reading and writing R code, and thereby improves project-level collaboration.
Use styler and RStudio’s canonical markdown mode to programmatically check and apply style guides to your project files.
Note: This first session is more conceptual and is heavier on the reading and explanation, but is important for the next sessions.
One of the first things to consider when working collaboratively on a data analysis project (and probably other types of projects too) is what software to use for your project. This starts out at the highest level: Are you using R or some other software for the analysis? Since this is an R workshop, we’re assuming the software will be R! 😜
The next consideration is determining which packages your project will depend on to produce the results. When working collaboratively with others (or even with your future self), you’ll need a clear way to track the project’s package dependencies, and, ideally, have a simple and efficient method to install or update them.
Part of the approach covered in this workshop requires that you follow a “project-oriented” workflow when working on, well, a project. In order to know how to track your project’s package dependencies, you need to first know, what a “project” is and how do we work around it. Since the introduction workshop’s first session on the Management of R Projects, we’ve consistently taught and used this workflow-style. In fact, it is embedded into the use of the R Projects via the .Rproj files and in the use of the here package. So, we’re already following a approach, which will make it easier to track package dependencies of our project.
Let’s start with the AdvancedR3 project, created in Chapter 5, that includes the lipidomics dataset. We have code in the data-raw/nmr-omics.R file already that uses some packages. Let’s assume that your project will be more complex than this and that you will eventually need some collaborators to contribute who are experts in for instance metabolomics data processing and in statistical analysis of high-dimensional data. You know you will end up needing to use other packages. You also know that you all need some way of tracking which packages are used so that when others join and contribute to the project, they can as seamlessly as possible install or update the packages your data analysis project needs. There are a few ways of “tracking” package dependencies.
The simplest, but most primitive way is to always make sure to use library() at the top of each R script for each package that the R script uses. We can call this “informal” dependency management.
Let’s review the advantages and disadvantages of this form of dependency management:
Advantage:
It’s the easiest to conceptually understand and to use.
Disadvantages:
It doesn’t track project-level dependencies very well, since multiple scripts might use the same packages, while some might use different ones. This means that you can’t easily and quickly install or update all the packages used in your project. You will probably have to go through each R script manually and install each package manually.
It doesn’t track the versions of the packages your project depends on, so if a package gets updated and it breaks something, you might not be able to figure out how to quickly fix that issue, especially for those deadline crunches.
You might have seen a similar “informal” approach where scripts starts with code that looks like this:
This code checks whether a package exists (i.e., is installed), and if not, it installs it. But! This is not an optimal method to track packages because require() won’t load the package if it doesn’t find it. As a result, you will have to rerun the script a few times if you don’t have that package installed; first, to install the package, then to load it. Plus, sometimes you may need to restart the R session after installing a package in order for R to detect it properly. So, this is not a very efficient way to track dependencies either.
What’s the alternative, then? One alternative is a more formal way of managing your dependencies, which is what we’ll cover in the next section.
CautionSticky/hat up!
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
15.3📖 Reading task: Formal dependency management
Time: ~3 minutes.
The most common form of a more formal way of managing dependencies— at least based on R packages and projects found on GitHub—makes use of the DESCRIPTION file and the usethis::use_package(). We covered this style of dependency in the intermediate workshop. We will also use this approach during this workshop, but we’ll expand on it a lot more. We use and recommend it for managing dependencies because many tools and workflows rely on it, making it well-integrated with R projects.
Let’s take a look on the advantages and disadvantages of this way of tracking dependencies:
Advantages:
Relatively easy to conceptually understand, since you can directly view the packages used in the by opening the DESCRIPTION file and looking at the contents.
Because it is widely used, there are many processes already built around making use of tracking dependencies this way. For instance, when creating an R package, you need to track package dependencies this way.
Installing packages is as easy as opening the project and running pak::pak() in the Console, which will install all the packages listed in the DESCRIPTION file.
Adding packages is as easy as writing usethis::use_package("packagename") in the Console.
Disadvantages:
Like the previous method, it doesn’t automatically keep track of the versions of the packages you are using. Even though it is possible to add a specific version to the DESCRIPTION file, pak will install the newest version of the package.
Your project might still rely on a package that is installed on your computer which influences your project. It could be that it isn’t obvious as a dependency, so you forget to include it. Meaning, you are still responsible for adding the project dependencies to the DESCRIPTION file.
Tip
There is an even more formal and structured way of tracking dependencies using the renv package, that we covered in previous versions of this workshop. It is a great package to use, especially if you are working on a project with several others or a project that requires tracking of not just the packages but their specific versions. We don’t teach it anymore because it is quite complex and difficult to understand. Moreover, for the majority of projects, it’s not strictly necessary.
As we work on the project during the workshop and realize we need to use a specific package, we will continue using usethis::use_package() to install it and add it to the DESCRIPTION file.
CautionSticky/hat up!
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
15.4🧑💻 Exercise: Add packages from the data processing script
Time: ~10 minutes.
Before continuing with the exercise, be sure to add and commit all the files in the AdvancedR3 project to save the current state of the files to the Git history. Open the Git interface by either typing with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”) or by going to the Git pane and clicking the “Commit” button. Then, write a commit message and click the “Commit” button in this window.
Let’s update the DESCRIPTION file with the packages we depend on in the data-raw/nmr-omics.R script. Open nmr-omics.R file and complete the following tasks:
Manually look for package dependencies in the R script that are declared with library() or ::. It can help to use the “Find in files” feature in RStudio to look for all places that have either :: or library. Use either Edit -> Find in Files ..., or with Ctrl-Shift-FCtrl-Shift-F or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “find files”) to open up the search popup.
In the Console, run usethis::use_package(packagename) for each package you find in data-raw/nmr-omics.R (from 1. above).
Once done, open the Git interface with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”) (or going to the Git Pane and clicking the “Commit” button). What has been changed? Commit those changes to the Git history with a descriptive message.
Click for the solution. Only click if you are really struggling or are out of time for the exercise.
Make sure that everyone has added the right packages, since it can be easy to miss some of the packages referenced using ::.
CautionSticky/hat up!
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
15.5 Two types of dependencies: Build and workflow
Note🧑🏫 Instructor note
Verbally explain the content of the next few paragraphs (don’t need to show the website on the projector).
When you work on a research project that involves data analysis, you’ll likely use packages in two different ways:
Packages that directly contribute to data wrangling, analysis, plotting, and making the manuscript. These types of packages are generally called “build” or “deploy” dependencies. Packages like dplyr or tidyr are usually build dependencies, since you use them for processing data.
Packages that assist you in doing your work but aren’t directly used for data analysis. These types of packages would be called “workflow” or “development” dependencies. usethis is usually considered a workflow dependency.
TipIs this package a build or workflow dependency?
A good rule of thumb to determine whether a package is a build or a workflow dependency is:
If you write and use functions from the package within an R script that does something to the data or analysis, then it is likely a build dependency.
If you only ever use functions from the package in the Console, then it is probably a workflow dependency.
So, now we know that the dependencies we added in the exercise are actually build dependencies, since we use them directly in our project to process the data.
The way you add packages to the DESCRIPTION file is slightly different depending on what type of dependency it is. For build dependencies, we use the function we’ve already used before: usethis::use_package("packagename"). For workflow dependencies, it’s the same function with a small twist. Our first workflow dependency is usethis, which we will use many times throughout the workshop.
But before we get into adding usethis to the DESCRIPTION file, let’s tweak our personal (also called the global) .Rprofile to make our lives a bit easier. We’ll add a setting so we don’t have to type usethis:: before each function.
Let’s restart R with Ctrl-Shift-F10Ctrl-Shift-F10 or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “restart”). Now, we’re ready to use use_package() to add usethis as a workflow dependency. The only difference between adding a build and a workflow dependency is that for workflow dependencies, we need to add a second argument to use_package() to specify that it is a workflow dependency. The second argument is "suggests", like so:
If you open the help file for use_package() you’ll see that the second argument we used above is called type, and “Imports” is its default value. So, if you don’t specify the second argument, the package will be added under “Imports” in the DESCRIPTION file. This is exactly what we did in the exercise in Section 15.4!
So, “imports” corresponds to build dependencies, while “suggests” are equivalent to what we call workflow dependencies.
Open the Git interface and see that under Suggests: in the DESCRIPTION file is usethis. Let’s commit these changes, with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
You’ve used styler during the pre-workshop tasks and we’ll use it more throughout the workshop. Which dependency do you think this is? It’s a workflow dependency, as we will mostly likely only use it in the Console or Command Palette. Let’s add it to the project DESCRIPTION as a workflow dependency:
For the information block below, mention it to the learners but you don’t need to go over it. Especially mention the second part of the tip.
Tip
When you come back to a project after a few months or if you start collaborating on a project, you can run pak::pak() to install all types of dependencies, both workflow and build. Neat! 🎉
Before we move on, let’s make sure everything has been committed to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
15.6🧑💻 Exercise: Connect your project to GitHub
Time: ~25 minutes.
Since we will eventually connect our project Git repository to GitHub to practice working with GitHub, we’ll connect our project to GitHub right now.
Remember that GitHub is a platform for sharing your projects in a way that is transparent (if you add all the relevant files and keep a clean history of your files) and makes it easy to collaborate with others. It’s a great way to share your project with others, get feedback, and even contribute to other projects.
Let’s complete these tasks to connect to GitHub.
First, commit the latest changes to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
If you haven’t yet, please create a GitHub account.
Add the gitcreds package as a workflow dependency. You’ll need it for the next item
Check your GitHub to make sure the project repository has been uploaded to it.
CautionSticky/hat up!
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
15.7 Styling Markdown files
For multi-person collaborative projects, having some type of code styling and checker can really help with standardizing how the code looks, which ultimately will make it easier to read each other’s code contributions.
But what about for Markdown files? While there isn’t a package or function (yet) that styles the Markdown files, RStudio does have an option in their Tools to format Markdown into a “canonical form”. The reason for this option is because they’ve added a “visual editor mode” to writing R Markdown / Quarto files (which is great if you are more comfortable with apps like Word). Let’s set up the project to automatically reformat Markdown files.
Warning
Use this option only if you have your project under Git version control, since it will directly modify and overwrite the contents of the entire file.
There are two ways of doing it:
Go into Tools -> Project Options -> R Markdown and change the options “Automatic text wrapping” to “column” (with the default 72 width value) and “Write canonical visual mode markdown” to “true”.
Or set the YAML options in either the project-level _quarto.yml file or at the file-level in the YAML header.
For right now, we will do the project option settings so that as long as we are using RStudio and in the R Project, it will automatically reformat the Markdown files as you write in them. Follow the instructions in item 1 above to set the options.
Now, when you save your file, RStudio should automatically reformat the Markdown into a standardized format. If you want to switch to using the Visual Mode, use Ctrl-Shift-F4Ctrl-Shift-F4 or the “Visual” button at the top of the Source Pane beside the bolding and italicizing buttons.
The instructors won’t be using the Visual Mode during the workshop, however, you are welcome to do so. We will be using the “canonical” Markdown mode in the “source” (default) mode though.
Let’s test it out. While in the doc/learning.qmd file, go to the bottom of the file and type out:
## This is poorly formatted- Definitely should have an empty space above this list.- This isn't a list, why not?
Save the file. What happens? Lists in Markdown need to have an empty space above them to work properly (except for when they are below a header, but in all other cases it needs a space above). With the canonical mode on, we can get feedback right away that it isn’t right. It gets automatically fixed by adding that empty space.
## This is poorly formatted- Definitely should have an empty space above this list.- This isn't a list, why not?
Since this mode is on automatically now, as we work in the doc/learning.qmd file through the sessions, we’ll get lots of experience using it.
15.8🧑💻 Exercise: Update the README file while using canonical Markdown mode
Time: ~10 minutes.
Open up the README.md file and do all the TODO items found inside the file as well as inside the TODO.md file. Save often and watch as the Markdown gets reformatted.
After you are done, commit the changes you made to the Git history with Ctrl-Alt-MCtrl-Alt-M or with the Palette (Ctrl-Shift-PCtrl-Shift-P, then type “commit”).
Then delete the TODO.md, followed by committing these deletions in the Git history. Click the “Push” button to push the changes to GitHub.
CautionSticky/hat up!
When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒🎩
15.9 Summary
In this chapter, we have covered how to do the following in a project for smoother collaboration with formal dependency management:
Track your project package dependencies with the DESCRIPTION file and usethis.
Install the dependencies necessary for your project with pak::pak().
Use RStudio’s canonical markdown mode to reformat Markdown into a standard format.
15.10 Code used in session
This lists some, but not all, of the code used in the section. Some code is incorporated into Markdown content, so is harder to automatically list here in a code chunk. The code below also includes the code from the exercises.