start2finish is a project I created to share resources and provide a simple starting place for building a reproducible project. When I created this, I had my mentors and peers in mind, hoping that this guide would take away the fear from creating an R package for their own projects.
R from start to finish: Organizing your dissertation work with a reproducibility mindset using R and RStudio
I presented this poster at the Ecological Society of America's Annual meeting: ESA 2020 abstract and ESA poster
Have you ever experienced running old code and having it break? Or found the code associated to a scientific article, and when trying to understand it or run it, realize it is almost impossible to figure out?
As an ecologist, and as many of us, I started my journey with R and data analysis as a self-directed adventure, first learning to code, and later realizing about the importance of reproducibility. Particularly associated with R programming, there is an overwhelming number of resources for reproducible research. This poster is meant to be a resource, a short guide and starting point for setting up a reproducible workflow in R.
Most of the workflow relies on the usethis package and you can find a short tutorial on building packages here.
Perhaps this depends on the type of work you do, I'm not an expert and don't have particularly strong opinions. However, a package makes you follow certain conventions to keep things organized. I am a fan of writing functions in an R package and writing detailed documentation using the 'roxygen2' package. Building a package is a nice way to keep things together, organized, and clear. Although I am sure that you can also create a chaotic package as well.
Although you don’t need these two for setting up a project in R but maintaining version control is highly recommended and fundamental for reproducibility. This means that there is a history for your code and analysis. Connecting Git and GitHub to RStudio is system dependent, a good resource for this process can be found in happygitwithr.com
Before you run these steps, make sure you have installed the following packages: usethis, roxygen2, renv, here.
- Create the package, add a license if you want to, and if feeling adventurous you can create a GitHub repo for it. The reference functions for the
usethispackage can be found here. Running this function will open a new R session with your package!
usethis::create_package("your package path")- Keep track of the packages that you use with the
renvpackage.
renv::init()
renv::snapshot()- Use the
herepackage and avoid starting scripts withsetwd("your/specific/path/that/does/not/work/on/another/computer). I will be honest, I had a hard time understanding this package, until I ran across Jenny Richmond's post on how to use theherepackage. It comes down to the difference in file paths between .R and .Rmd files.
here::here()- Use
dplyror base R, to clean your data using R scripts. Any changes or deletions that happen in the spreadsheet are lost and forgotten in the realm of non-reproducible clicks. Clean your data with scripts so that you can always go back to the original and be certain of what changes have been made during the cleanup. Broman & Wu, 2018 has great advice on working with spreadsheets. - Write your analysis and even your manuscript in
rmarkdown. There are several packages out there that usermarkdownand will help set up different types of articles. You can even create presentations withrmarkdown. For simplicity, if usingrmarkdownand version control (Git and GitHub), you can avoid having several final.docx versions of your work.
When I am starting a new project, I follow these steps:
usethis::create_package("projects/mypackage")
usethis::use_mit_license(name = "Your Name")
usethis::use_git()
usethis::use_github()
usethis::use_readme_rmd()These steps will create my package, my GitHub repo and a README with rmarkdown so that I can include chunks of code and figures with it. After that setup I will start tracking my packages:
renv::init()
renv::snapshot()I will load some of the packages I know I will use in my work:
usethis::use_package("dplyr", "ggplot", "fitdistrplus")And then save the changes with
renv::snapshot()After the snapshot, you can commit your changes, and push them to your repo so that your lockfile (revn.lock) is updated. Any time that new packages are loaded, you repeat these steps.
You can create your first script, add a function with descriptions, and use roxygen2 for that. You can find a short tutorial here
usethis::use_r("name of your script")This setup is intended for you to take the leap, and get started. There are a number of resources out there, perhaps too many sometimes. If you'd like to jump over to "how do I write my manuscript in rmarkdown" you should definitely check out Anna Krystalli's Reproduce a paper in Rmd and follow some of the resources bellow.
- Anna Krystalli (@annakrystalli) and her talk “Putting the R into Reproducible Research”
- Sharla Gelfand (@sharlagelfand) and her talk at rstudio::conf(2020)
- Karthik Ram (@_inundata), his GitHub repo and talk at rstudio::conf(2019)
- ‘thesisdown’ repo from Chester Ismay (@old_man_chester) – this one is specific for dissertation writing
- ‘rrtools’ project and Ben Marwick (@benmarwick), who also has several publications on this topic.
- Reproducibility in science – guide, from rOpenSci
- Open Science Framework (@OSFramework)
- Broman, K. W., & Woo, K. H. (2018). Data organization in spreadsheets. The American Statistician, 72(1), 2-10.
- Jenny Bryan's STAT545
- Don’t forget your research services or reproducibility librarian!
- Boettiger, C. (2018), From noise to knowledge: how randomness generates novel phenomena and reveals information. Ecol Lett, 21: 1255-1267. doi:10.1111/ele.13085
