Support

Troubleshooting Launcher and Slurm in RStudio Server Pro

Follow

We recommend reviewing two specific documents created by Slurm as they will be very useful for the success of using RStudio Server Pro's Job Launcher service with the Slurm integration:

Question: How do I verify the Slurm cluster functionality?

Answer: To verify that your Slurm cluster is functional and accepting/running jobs, you can perform the pre-flight configuration checks documented in the steps for Configuring RStudio Server Pro with Launcher and Slurm.

Question: How do I verify RStudio Server Pro with Launcher and Slurm?

Answer: Run the following command to test the installation and configuration of RStudio Server Pro with Launcher and Slurm:

sudo rstudio-server stop
sudo rstudio-server verify-installation --verify-user=<USER>
sudo rstudio-server start

Replace <USER> with a valid username of a user that is setup to run RStudio Server Pro in your installation.

Refer to the Troubleshooting section in the RStudio Server Pro Administration Guide for more information on using the Launcher verification tool.

Question: Where are the logs stored for RStudio Server Pro and Launcher? 

Answer: The logs for RStudio Server Pro and Launcher can be found at:

  • /var/lib/rstudio-server/monitor/log/rstudio-server.log
  • /var/lib/rstudio-launcher/rstudio-launcher.log
  • /var/lib/rstudio-launcher/Slurm/rstudio-slurm-launcher.log

You can inspect these logs for errors after attempting to launch a session or job on Slurm.

Question: How does RStudio use Slurm?

Answer: The RStudio Slurm Launcher Plugin uses the Slurm command line tools to control the Slurm cluster. Commands are run as either:

  • the user starting or viewing the job
  • the slurm-service-user (configured in the launcher.slurm.conf)

Note: The slurm-service-user should be a user that has administrative access to the Slurm cluster - they should be able to see job details for all users.

Question: What does the RStudio Launcher Host require?

Answer: The RStudio Launcher Host requires:

  • the same slurm.conf file as the desired Slurm cluster
  • network access to all of the Slurm compute nodes and the control node
  • the Slurm command line tools installed (please see below)
  • file-sharing configured with compute nodes
  • It is not necessary to have slurmctld (the Slurm Control Daemon) or slurmd (the Slurm Compute Daemon) running on the RStudio Launcher Host to make the configuration work. It is necessary to have slurmctld running on the Slurm Control Node, and at least slurmd running on a Slurm Compute Node for everything to work. Note: if you are using an authentication plugin (an add-on to Slurm that manages user authentication across the cluster) that does have to be installed and running. The Slurm Quick Start Administrator Guide refers to MUNGE - that's the recommended authentication plugin.
  • To get an idea of Slurm's architecture, please see the diagram in the Slurm's Quick Start User Guide

Question: What are some of the common Slurm command line tools mentioned above and where do I run these?

Answer: See below for the common commands. These commands are run from the RStudio Launcher Host machine. These commands should not be run as root, they should either be run as the user experiencing issues or as the slurm-service-user. Unless otherwise specified, these commands should be run as the user experiencing issues.

  • sinfo - to list queues/partitions for the New Session and Run Script dialogs. Good to check general connectivity.
  • sinfo --format=%R --noheader - to check current queues/partitions
  • sbatch - used to submit jobs
  • scontrol show job - used to view and modify configuration and state
  • scontrol show job [job id] - to view and modify configuration and state of a specific job
  • squeue - used to get job status updates (note that this command is always run by the slurm-service-user)
  • sstat - used to get resource utilization metrics (note this is always run by the slurm-service-user). This command requires that Job Account Gathering is enabled in slurm.conf
  • tail -f - used to stream job output data
  • For more information on the Slurm command line tools, please view Slurm's cheat sheet.

Question: Do you have general guidance on troubleshooting issues with the Slurm Launcher Plugin?

Answer: Start by looking for errors in the output of sudo rstudio-launcher status. If there are no errors or if they are vague, enable debug logging and check /var/lib/rstudio-launcher/Slurm/rstudio-slurm-launcher.log. This documentation and this FAQ also have useful information.

Question: How do I troubleshoot a version warning?

Answer: You may see version warnings if you are not using our only supported version (19.05.0). We would recommend checking for errors parsing Slurm commands in the rstudio-slurm-launcher.log.

Question: How do I troubleshoot startup failures? The Slurm Launcher Plugin does not seem to be working.

Answer: 

  • Is the Slurm cluster running? 
    • If no, start the Slurm Cluster and try again. If the Slurm Cluster is still not running, we would recommend checking the SlurmctldLogFile and SlurmLogFile (both configured in the slurm.conf) for errors.
  • Are the Slurm command line tools installed on the RStudio Launcher Host?
  • If the Slurm cluster is running and the Slurm command line tools are installed, is the output of running sinfo from the RStudio Launcher Host correct?
    • It is recommended to double-check that the slurm.conf on the RStudio Launcher Host is the same as the slurm.conf on the desired Slurm Cluster. If it is not, update the RStudio Launcher slurm.conf and then have the user try again.
    • Can the DNS and/or IP Address of the Slurm nodes be resolved? Try running ping <slurm control node hostname> from the RStudio Server. If this fails, we'd suggest updating your /etc/hosts as necessary.
    • If yes and you continue to have problems, we'd recommend contacting RStudio Support.

Question: How do I troubleshoot missing queues/partitions?

Answer: If there are missing queues/partitions in any of the job launcher dialogs, the user should check the output of sinfo --format=%R --noheader. This would be run as any user experiencing the problem. If the list here is wrong or not expected, the Slurm configuration should be investigated. If the list is correct, please contact RStudio Support.

Question: What should I do if I have Job or Session failures?

Answer: 

  • To the Slurm Launcher Plugin, a session is just a job
  • Run scontrol show job, does the job appear?
    • No - the errors should be in the Slurm Launcher Plugin log file
    • Yes - the errors should be in the job error output (see below)

Screen_Shot_2020-03-26_at_1.58.25_PM.png

Question: How to troubleshoot the job status not updating?

Answer: 

  • Not all Slurm job states are reflected as separate RStudio Job Statuses
  • Has the job status actually changed? Check the output of squeue --state=all --Format=jobid:10, name:75, username, state. Run this as the slurm-service-user. If the answer to this is yes, please contact RStudio Support.
  • Below is the idea of the mapping RStudio put in place between RStudio Job Status and Slurm Job State

Screen_Shot_2020-03-26_at_2.04.05_PM.png

Question: There is no job output, how do I fix this?

Answer: 

  • Can the job output file be reached? Try running ls -l <StdOut or StdErr path>
  • If yes, what does cat <StdOut or StdErr path> look like?
  • If both of those look normal, we'd recommend contacting RStudio Support.

Screen_Shot_2020-03-26_at_2.10.47_PM.png

Question: I can't enter a Session, how do I fix this?

Answer: The below steps assume the session status is idle from the RStudio Server Pro Home Page.

  • Can the session job output be read? Try checking the job details page.
  • Can all Slurm compute nodes be reached by the RStudio Server Host?
  • Is there a firewall preventing a connection over the SEssion Port (a random port from the ethereal port range)?
  • If the answers to the above are Yes, Yes, and No, the next steps are to diagnose session issues as without the Launcher.

Question: Why am I not seeing any resource metrics?

Answer:

  • Is Slurm's Job Account Gathering feature enabled? If not, please view Slurm's jobacct_gather plugin configuration to get it configured.
  • Is the resource metric data printed when sstat --format=AveCpu, AveVMSize, AveRSS is run as the slurm-service-user?

Question: What do these log entries mean?

Screen_Shot_2020-03-26_at_2.26.01_PM.png

Comments