Support

Persistent Storage on RStudio Connect

Follow

When deploying content to RStudio Connect, it is often the case that data, files, or other state needs to be maintained across multiple deployments or across multiple content endpoints.

Example

My organization has an ETL (Extracting, Transforming, and Loading data) script running in a scheduled R Markdown that cleans up our data every night. We also have two Shiny applications that need access to the data and present the information to end users. One of the Shiny applications allows making edits and corrections to the data if necessary. There are several different places that we might store this data in order to make this process successful. We will overview the several options below.

Relative References

The most natural place to store information is in a relative location (i.e. readr::read_csv(“myfile.csv”)). This can work well in a Shiny application, and is very natural for developers. The data is simply published with the application and can be updated each time the application is published. This method is ideal for lookup tables, application metadata, or small / lightweight data.

Drawbacks to Relative References

Multiple References

Your team may want to schedule R Markdown reports that update this data, or a Plumber API that accesses the same results. With relative references, data is not accessible from other content on RStudio Connect.

Dangers for R Markdown

While all instances of a Shiny application have access to the same relative directory, R Markdown documents are rendered in a clean temporary space each time. This means that files written locally during an R Markdown render will not persist in the code directory, and will be completely discarded after render. This can make it more challenging for R Markdown documents to rely on relative references.

Absolute References

When relative references fail, absolute references are another approach to storing data on disk. Absolute references consist in referencing the data from an absolute location on the system (instead of a reference relative to where the process is executing), like readr::read_csv(“/app-data/mydirectory/myfile.csv”).

Benefits to Absolute References

Unlike relative references, absolute references can be shared by multiple data products. Access is controlled by the unix file permissions. For example, if you want any content on RStudio Connect to be able to access the data, ensure the folder is owned and writable by the rstudio-connect group.

Sandboxed Processes

Absolute references must be concerned about the sandboxed processes on RStudio Connect (more on that here). RStudio Connect protects certain areas on the file system from access by the processes that it executes. As a result, when absolute references are in use, it is best to define a top-level directory (like /app-data) that is uniquely named and houses persistent data.

Drawbacks to Absolute References: Write Conflicts

RStudio Connect can run multiple R processes that serve the same Shiny application or plumber API. This allows many simultaneous users of the Shiny application. However, if the application does _write-based_operations, these multiple R processes could end up in conflict with one another if they try to write to the same file at the same time.

Usually, this can be remedied by writing to a temp space with fs::file_temp() or tempfile(), which generate unique filenames for each R session and user like RtmptsyTqD/file67041e68c556. However, these temporary file paths are less than ideal if you want to persist and share the data later.

Using a Database

SQLite

To address the limitations of absolute references, organizations can use databases. Many organizations start with a SQLite database, since it consists of a single file and does not require any additional administrative burden to manage.

SQLite does provide some benefits. We get the advantage of being able to define indices and reference large amounts of data quickly without having to read that data into memory. However, SQLite does not solve the multiple write problem, and can have issues when hosted on an NFS server (for instance, in a HA RStudio Connect cluster). Instead, we recommend using an alternative database.

Other Databases

Most other popular, web-app serving databases (PostgreSQL, MySQL, SQL Server, Teradata, etc.) can to varying degrees handle multiple concurrent writes. Further, these databases can define indices for performance and handle user-based authentication if necessary. The database is also accessible from multiple hosts (in an RStudio Connect cluster) and multiple content endpoints (i.e. deployed apps and rmarkdown documents), as long as authentication information is provided.

As a result, databases are a great fit for persistent data storage in general. Usually, organizations have some level of database expertise, and the RStudio Professional Drivers provide a simple interface to a wide variety of databases through the odbc package.

When possible, we recommend using a database to store persistent information on RStudio Connect

Transitioning from CSV files to SQLite to other databases is easy when using dplyr and dbplyr, which use standard dplyr verbs with various database backends.  For more information, visit db.rstudio.com.

Summary

Persistent storage varies in complexity from using relative file references, to absolute file references, and finally databases. Pick an approach that is appropriate for your use case, and be sure to use an appropriate location on disk and avoid the dangers of multiple-concurrent-writes.

Promotion from Development to Production

When using any approach other than relative file references, users very quickly run into the problem of needing separate development and production environments. For instance, a database for developers to practice their logic, or an absolute file reference on a Windows computer where development is taking place.

For both of these examples, we recommend using the config package. There is an article on db.rstudio.com explaining its use in more detail for database locations, credentials, and any manner of configuration variables that need to change based on environment.

For a file path configuration, a config.yml file like the following would make transition to absolute file references seamless:

default:
  myfile: ~/myfile.txt

rsconnect:
  myfile: /app-data/myapp/production.txt

This file would then be referenced by the following:

the_file <- config::get(“myfile”)
readr::read_tsv(the_file)

Comments