9 Team project work

To maximize how much you learn and how much you will retain, you as a team will take what you learned in the course and apply it to create a reproducible project. For this project, as a team you will do a simple data analysis using tidymodels on a dataset (see below), use Git and GitHub to collaborate together, write an HTML document as a report, and create a targets pipeline to demonstrate the reproducibility of the analysis. You will work on this project during the last session of the course.

Note

We may decide, depending on time, that in the last ~20 minutes of the project work session, the lead instructor will download each of the team’s projects from GitHub and run the targets pipeline to test the project’s reproducibility.

9.1 Specific tasks

You will be collaborating as a team using Git and GitHub to complete this project. We will set up the project with Git and GitHub for you so you can quickly start collaborating together on the project. You will be pushing and pulling a lot of content, so you will need to maintain regular and open communication with your team members.

Important

Before you start doing the tasks, read through this whole page so you are fully aware of what you need to do.

Your specific tasks are:

Get together with your team. We will already have created a repository for you to use, which we will show how to find during the course. Clone your team repository to your computer using usethis::create_from_github(). You will likely also need to authenticate with GitHub by using gitcreds::gitcreds_set().
Choose a dataset: As a team, decide on and select one of the datasets listed below to analyse. You can use the same dataset as another team, but you must use a different dataset than the one we used in class. These datasets were specifically chosen because they have a lot of variables, are relatively tidy, and are not too large.
- Integrated trajectories of the maternal metabolome, proteome, and immunome predict labor onset. Here is the link to download everything. You only need to use the Onset_of_Labor_proteomics.csv data file.
- Histologic, biochemistry, and metabolomics data for bearded dragons with hepatic fat accumulation. Here is the link to download everything. We suggest using the Metabolites_dragons_with_lipoproteins.csv data file.
As a team, decide who will do this task. Use usethis::use_data_raw() to create a data-raw/ folder and an R script in that folder in your project. Commit this change to Git and push to GitHub.
All team members pull from GitHub and all members must move the dataset you chose into your project’s data-raw/.
As a team, decide who will do this task. Use usethis::use_git_ignore() on the dataset file you saved into data-raw/ to stop it from entering into the Git history. Commit this change to Git and push to GitHub, then have everyone pull the changes.
Briefly look through the downloaded dataset, the README or other description, and the Zenodo record’s description to familiarize yourself with the data.
As a team, decide on what basic processing steps you will need to do to create a dataset cleaned and ready for analysis and decide who will write the processing code in the R script in the data-raw/ folder. Once done, push the changes to GitHub and have everyone then pull the changes to their own computers.
As a team, decide who will do this task. Within the data-raw/ R script, include code at the end to save the processed dataset into the data/ folder using readr::write_csv() and here::here(). Use usethis::use_git_ignore() on the processed dataset file to stop it from entering into the Git history. Commit the changes and push them to GitHub. Have everyone pull the changes to their computers and have everyone run the script in the data-raw/ folder so that everyone has a working of a tidier dataset in the data/ folder.
As a team, decide which variables you want to analyse and split the variables up equally between team members so that you each can do your own analysis of the data. See the tips below for some more details.
Using the workflow we’ve covered in this course, collaboratively create a reproducible analysis pipeline by:
- Using the targets package to create a pipeline going from reading the data to generating the HTML file.
- Using tidymodels to fit at least 3 different models (see the tips below on doing the analysis).
- Using a Quarto document to generate an HTML file with a table of the model results.
- Writing correctly formatted R code using styler and Markdown using Quarto’s canonical mode.
- Regularly committing your changes to Git and pushing to GitHub. When pushing changes to GitHub, you will likely cause some merge conflicts so you will need to discuss how regularly to push up and to inform your team when you do push up, since pulling the changes will result in merge conflicts that have to be resolved.

Dataset and analysis suggestions

Onset of Labor dataset:
- In this dataset, there are few outcome variables (y) and many predictor variables (x).
- We suggest that each person on a team of two select one of the two outcome (y) variables for the model, which are EGA (estimated gestational age) or DOS (days of labor onset).
- Since there are many predictor variables (x), we suggest each person select more than three of the other variables, which include any variable except the above outcome variables as well as ID, SampleID, and Timepoint.
Bearded Dragon dataset:
- We suggest each person select one outcome (y) variable for the models, choosing either grade (for degree of hepatic fat accumulation), HU (Hounsfield unit, positive values mean more dense liver), fat.perc (percent of fat in liver), or fibrosis.perc (percent of liver with fibrosis)
- We suggest each person select at least three predictor (x) variables for the models, from any of the blood variables, for instance (but not limited to) insulin, trig, VLDL, urea, or serine.
- You should ideally tidy the variable names using snakecase::to_snake_case() when processing them in the data-raw/ script to make them easier to work with.
General analysis tips:
- Stick to simple model formulas by using only one outcome and one predictor, e.g. y ~ x in the model.
- We strongly suggest you do not include more than one predictor in any individual model formula, as it will make the analysis and coding more complicated than is necessary for this course.
- Run at least 3 simple models per person on the team with different predictors using the tidymodels functions.

General collaboration tips

To minimize merge conflicts, create R scripts (that contain only functions) for each person and create a Quarto document for each person. This way, you can work on your own script and document without causing conflicts with your team members. The only file that will have conflicts is the _targets.R file, since you will all be adding targets to it as you develop your analysis.

9.2 Quick “checklist” for a good project

Project used Git and is on GitHub.
Code is formatted correctly via styler.
Markdown is formatted correctly via Quarto’s canonical mode.
Analysis is fully reproducible by running targets::tar_make().
Multiple models are computed programmatically using tidymodels.
Used Quarto for outputting the model results as a table.
Only R functions are kept in the scripts in the R/ folder.
Most analyses are computed in the _targets.R file.
Most code in the Quarto document is only used to display the results, not to do processing or analyses.

9.3 Expectations for the project

What we expect you to do for the group project:

Use Git and GitHub throughout your work.
Use styler with the Palette (Ctrl-Shift-P, then type “style file”) to format your R code.
Use Quarto’s canonical mode to format your Markdown.
Have a reproducible analysis pipeline with targets.
Work collaboratively as a group and share responsibilities and tasks.
Use as much of what we covered in the course to practice what you learned.

What we don’t expect:

Complicated analysis or coding. The simpler it is, the easier is to for you to do the coding and understand what is going on. It also helps us to see that you’ve practiced what you’ve learned.
Clever or overly concise code. Clearly written and readable code is always better than clever or concise code. Keep it simple and understandable!

Essentially, the team project is a way to reinforce what you learned during the course, but in a more relaxed and collaborative setting.