R Project Setup Guide - Lewis Does Data

It is often said that you can’t build a great building on a weak foundation. Data analysis projects are no different. Neglecting to setup your project correctly at the outset will likely result in increasingly poor data and code management practices as your projects grow in size and complexity. Eventually this will cause confusion, probable errors, and possible loss of hours, weeks or months’ worth of work!

“May you have a strong foundation” — Bob Dylan

The tutorial below covers the important aspects of how to start, structure and manage projects using RStudio along with Git and GitHub. I have also included some extra information here, along with a short demo of how-to setup an R project without using version control so that you can familiarise yourselves with the fundamentals before getting started with GitHub.

TL;DW

In this tutorial, I demonstrate how to start an R project using RStudio and best practice version control tools Git and GitHub. I also briefly cover a few ways of initialising the project directory structure following a format that I recommend for most data analysis projects. The tutorial ends with a quick discussion of data and project management best practices.

Linked Resources

The tutorial linked below will guide anyone that needs to setup a new GitHub account and configure SSH keys so that they can follow along with the video tutorial:

GitHub tutorial

This list of tutorials is for anyone wanting to learn more about how to integrate version control with Git into their data analysis project workflow:

Git tutorial part 1
Git tutorial part 2
Git tutorial part 3
Git tutorial part 4

Naming Projects

An often-overlooked aspect of project management is project naming conventions and the impact this can have on managing multiple projects at the macroscopic level; it can become incredibly difficult to keep track of individual projects when you have a large number of them split across many folders, file systems and storage media.

A sensible general guideline for naming projects is to start with the project start date in the ISO8601 format, followed by a brief project description. For example, 2022_03_18_r_project_setup_tutorial. The use of underscores (or hyphens) shouldn’t need any explaining; any data professional knows that whitespace has no place in file and folder names.

Modifying the .gitignore Template

When setting up your project repository on GitHub as shown in the video, adding the following to your .gitignore file will instruct Git to ignore the contents of the project data/ and results/ directories, as well as those incredibly irritating .DS_Store files that MacOS creates everywhere - ignore the latter addition if you are a Windows or Linux user.

# Additional user-defined directories
data/
results/

# Additional user-defined files
.DS_Store

Don’t forget to add a sensible commit message afterwards to remind yourself and anyone else viewing the project in future what changes were made at this point in the project history!

Results are Disposable!

The reason that I recommend ignoring the data/ directory in the .gitignore file is as simple as this: your project raw data should exist in a minimum of three places, two of these should use different storage media, and one should be a remote storage repository. This is otherwise known as the 3-2-1 backup strategy.

GitHub isn’t ideally suited for data backup purposes as it is intended for code and has a single file upload limit of 100MB: many projects will have raw data that exceed this limit, so you need to use a more appropriate means of remotely storing your data e.g., cloud storage.

As the section title says, the reason I advocate for ignoring the project results/ directory is that results are disposable. What I mean by this is that, if we ensure our raw data is safely and correctly backed-up, and we are correctly managing our project source code (ideally using version control tools like Git and GitHub), then there should be no need to worry about project outputs; these can always be regenerated using the other two elements.

With your R scripts (and your data files), you can recreate the environment.
It’s much harder to recreate your R scripts from your environment!

— Hadley Wickham

One excellent recommendation that I picked up from the R guru himself in the highly recommended R for Data Science is to instruct RStudio not to preserve your workspace between sessions:

This change to the global options means that when you restart RStudio, it will not remember the results of the code that you ran last time. This behaviour will remove your reliance on the RStudio environment and force you to capture all important interactions in the code itself.

There are some handy keyboard shortcuts that work together to enable you to check that your code is self-contained:

Either:

Cmd/Ctrl + Shift + F10 to restart RStudio
Cmd/Ctrl + Shift + 0 to restart the R session

Then:

Cmd/Ctrl + Shift + S to rerun the current script

Running one of these combinations at regular intervals during your coding sessions will ensure that you are not inadvertently relying on variables stored in your environment.

Project Directory Structure

Once you have setup your R project you can add your project sub-directories to your top-level folder, and then start adding in your data, code etc.

As I mention in the tutorial, you can do this either within the RStudio file explorer itself, or by using your system file explorer (e.g., Finder on MacOS) as you might do to create any other series of directories.

The way that I prefer to setup my project directory tree in order to save time is to switch to the Terminal and run a quick one-line shell command:

mkdir -p ./{docs,data,results/{proc_data,figures,reports},src}

If the Terminal tab is not visible next to the R console, then you can open one within RStudio by clicking “Tools” → “Terminal” → “New Terminal”

The shell command that I just showed sets up a series of directories in the format that I generally recommend for a data analysis project, these are:

docs for text documents relating to the project
data for raw data
results containing the following sub-directories

proc_data for processed data
figures for graphical outputs
reports for presentations or reports generated for the project

src which contains the project source code, scripts, and markdown files

Once you’ve setup the project directories, you’re ready to start populating them with data, code and results, ensuring that you’re making regular commits to the Git log using intuitive commit messages.

R Project Creation Without Version Control

Starting a project without using version control in RStudio is a matter of a couple of clicks; you can either use the button (red box) in the top-left hand side corner of the RStudio main window.

or navigate to “File” → “New Project…”

and then follow the new project wizard:

Enter an intuitive name for your project top-level directory, click “Create Project”, and you’re done.

Summary()

This tutorial covers most of the things that you will ever need to start and manage R Studio projects while adhering to best practices in terms of project and data management.

But this is not the full story! We haven’t covered how R projects enable you to write robust and portable code by enabling easy working with file paths. Nor have we covered project dependency management.

For those of you that are interested, there are a couple of great packages that facilitate these tasks, namely the here and renv packages, but I will be back with tutorial on both important topics in the future.

Catch you next time!

View Session Info

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.2.1 (2022-06-23)
##  os       macOS Big Sur ... 10.16
##  system   x86_64, darwin17.0
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/London
##  date     2023-03-20
##  pandoc   2.17.1.1 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  cli           3.6.0   2023-01-09 [1] CRAN (R 4.2.0)
##  digest        0.6.31  2022-12-11 [1] CRAN (R 4.2.0)
##  evaluate      0.20    2023-01-17 [1] CRAN (R 4.2.0)
##  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
##  htmltools     0.5.4   2022-12-07 [1] CRAN (R 4.2.0)
##  knitr         1.42    2023-01-25 [1] CRAN (R 4.2.1)
##  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.0)
##  rmarkdown     2.20    2023-01-19 [1] CRAN (R 4.2.0)
##  rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.0)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
##  xfun          0.37    2023-01-31 [1] CRAN (R 4.2.0)
##  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.2.0)
## 
##  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
## 
## ──────────────────────────────────────────────────────────────────────────────

. . . . .

Thanks for reading. I hope you enjoyed the article and that it helps you to get a job done more quickly or inspires you to further your data science journey. Please do let me know if there’s anything you want me to cover in future posts.

Happy Data Analysis!

. . . . .

Disclaimer: All views expressed on this site are exclusively my own and do not represent the opinions of any entity whatsoever with which I have been, am now or will be affiliated.

← Previous Post Next Post →