Scaling R and RStudio


The following document presents some FAQs for scaling R and RStudio.

Q: I want to develop a platform to scale R for my organization. Can RStudio Server Pro help?

The first step is to determine the type of scale you are hoping to achieve. The following table presents an overview of the three most common cases.

Use Case Problem Solutions Technology
Scaling for Many R Users Regular R workflows for a team. Includes loading data subsets from files or warehouses Create a platform to support large-scale individual interactive R session(s) RStudio Server, RStudio Server Pro + Load Balancer
Scaling for HPC Embarrassingly parallel tasks like: bootstrapping, cross validation, scoring, model fitting on independent groups Develop code in an interactive R session in RStudio. Submit code in batch jobs on slave R processes. R must be installed on all slave nodes. Local: parallel, Rmpi, snow, Rcpp parallel;
Cluster: LSF, SLURM, Torque, Docker Swarm;
Recommendation: batchtools package
Scaling for Big Data Big data, black box routines that require fitting a model against an entire domain space. Data can’t fit on one machine. R is an orchestration engine. Heavy lifting is done by a different compute engine on the cluster. R syntax is used to construct pipelines, and R is used to analyze results. Hadoop, Spark, Tensorflow, Oracle BDA, Microsoft R Server, Aster,

RStudio Server Pro is designed to help your organization scale for a team of R users. The tool includes features for project sharing, collaborative editing, session management, and IT administration tools like authentication, audit logs, and server performance metrics.

Q: Will RStudio Server Pro’s load balancer run my R job across a cluster?

No. RStudio Server Pro’s load balancer balances R sessions across the cluster. Each individual R session remains on a single server. Any parallelization across the cores on the server or across the cluster will require the R analyst to write or submit parallel code. (See scaling for HPC).

A load-balanced RStudio Server Pro cluster is designed to support larger teams of data scientists. The load balancer ensures that a new R session will go to machine with the most availability, and that features like the admin dashboard and project sharing will scale as you add RStudio nodes.

Q: I have an HPC cluster. (LSF, SLURM, Torque). Do I need RStudio on each node?

Typically no. RStudio is used by analysts who are running R interactively. If you need to support many R users, it may make sense to install RStudio Server Pro on a number of nodes and load balance between them.

Usually HPC systems are designed for batch job submission. In R, this is usually done by submitting R scripts that each run a small, independent part of a bigger problem. (Or, a single R script may be submitted many times.) Alternatively, a single R script that includes explicit code to parallelize across multiple cores or a cluster could be submitted. Either way, these scripts are usually written and tested interactively, but then submitted in batch for a full run. You could install RStudio Server on one of the HPC nodes to aid in developing, testing, and debugging these R scripts, but the actual job that requires the cluster will be executed in batch. This batch submission requires R, but not RStudio, to be installed on every node.

Q: I have a Hadoop cluster. Where should I install RStudio?

There are many ways to interact with Hadoop from R. One of the most popular solutions is to use R in combination with Spark. In this workflow, R is an orchestrator. The analyst writes R code, and the R code in turn directs the heavy-lifting to a separate computational engine (Spark). As an orchestrator, R is communicating extensively with the cluster. Often small, aggregated results are brought back into R for further analysis. For those reasons, it is recommended to run R and RStudio on an edge node of the cluster.

A few solutions that follow this workflow include: sparklyr, Microsoft R Server,

Q: I have a Data Appliance that supports R (Oracle BDA, Teradata, SAP Hana, Microsoft SQL Server). How can I use RStudio?

There are usually two types of integration between these tools and R. (Some of the tools support both types.)

Type 1: The appliance calls R, which returns its results to the appliance.

Many appliances define their own processing step that reaches out to R. For example, an analyst can write a SQL statement that includes a calculated column, where the calculation is an R function call.

RStudio cannot be directly used in this case. However, the R analyst creating the function can develop and test the code in RStudio.

Type 2: R calls the appliance, which returns its results to R.

In this use case, the appliance is treated as a data source or external computation engine for R. For example, I might write a query that returns a subset of the data into R. Or, I might push the computation of a supported model into the data warehouse. Usually the integration between R and the Appliance is provided by a specialized R package.

For Type 2, the RStudio IDE is used. The R package abstracts the details of communicating and accessing the appliance.