An important concern in any organization adopting R is code stability. Tooling is required to meet the needs of R users which include access to the latest packages and reproducible environments. In regulated environments, an additional level of control may be required to ensure certain teams or projects only use groups of packages that have undergone additional vetting. Administrators also need to worry about the availability of different versions of R and system dependencies that R packages may rely on.
There is not a single solution that meets the needs of every organization. RStudio Package Manager is designed to support a number of strategies. Three common strategies are outlined below.
Approach 1: Client Side Management
The most common approach to managing change control is for the user to manage package dependencies for a project. An easy way R users can do this is by creating project-specific libraries of packages. In this model, administrators specify what packages are available in the repository. Users create a specific library for each project and are responsible for installing the packages required for their project. Over time, users can upgrade packages or add new packages.
For reproducibility, users must maintain a record of which package versions are included in their libraries. This record allows the user to recreate the environment from a clean slate by requesting the specific version of each package from the repository. There are a number of R packages designed to help users accomplish this task, including the
packrat package. RStudio Connect automates this approach for users when they deploy content.
This model relies on the ability of the repository to handle multiple versions of each package. R repositories handle this task using a specific structure called an archive. RStudio Package Manager automatically handles archiving packages and knows how to respond to requests for older packages from tools like
Approach 2: Repository Versioning
Another common approach is for users to pin a project to a specific version of the package repository. In contrast to the first approach, this approach does not rely on the repository managing an archive of package versions. Instead, the entire repository is versioned. Like the first approach, this strategy relies on the user creating a specific package library for each project. However, instead of recording the versions of each package in use, the user simply records the version of the repository and the names of the necessary packages. To recreate the package environment, the user simply has to install the requisite packages from the recorded version of the repository. The most popular example of this approach is the
checkpoint package which relies on Microsoft’s online copy of CRAN.
RStudio Package Manager supports this strategy by automatically associating every action in a repository with a versioned id. See Section 14.2 of the admin guide for details. Users can get started by configuring the Repository URL in the “Setup” page to always use the current packages.
Approach 3: Locked Down Repositories
Some organizations prefer to manage which versions of packages are used across the organization instead of on a project-by-project basis. A common strategy entails:
Administrators test a set of desired packages and then freeze that set. Users are able to use only those versions of the packages.
A few times per year, administrators test and update the set of packages. They might also approve new packages.
RStudio Package Manager enables this strategy through curated CRAN sources. RStudio Package Manager also enhances this strategy by adding the option for a step between 1 and 2. When an administrator creates a curated CRAN source and defines a set of packages (step 1), RStudio Package Manager also records the state of all of CRAN at that moment. This enables an administrator to later add packages to the approved subset without worrying about updating the entire set.
Curated CRAN sources can be used with any of these strategies if organizations want to apply additional governance policies such as limiting packages to those with approved licenses.
What about versions of R?
In order to capture and recreate the environment for an R project, organizations must account for managing the version of R in addition to managing R packages.
In general, a library of installed packages is only compatible with a single version of R. For a new version of R, users or administrators must re-install the desired set of R packages. Luckily, managing a repository of packages facilitates re-installing packages into different libraries.
What about Docker?
Docker can play an important role in creating reproducible environments. Docker specifies the steps for creating an environment in a Dockerfile. A Docker image is created by running each of those steps.
Normally, a Dockerfile for an R project will include one or more steps that install R packages:
RUN Rscript -e 'install.packages(...)'
Using Docker in this way facilitates reproducibility, but Docker alone is not sufficient to guarantee reproducibility. The reason Docker is insufficient is because each time the
Docker image is recreated the R command to install packages is re-run. Just like a Dockerless environment, this command can return different results over time.
Luckily, the same approaches outlined above also work with Docker. Simply replace the
install.packages command with a variation that uses a frozen repo or restores a specific environment using a tool like