R scripts under LSF


R is a programming language, and a software environment, for software statistics and graphics. It is quite popular among statisticians, for example, also because it can be easily be extended using specialized packages. There are a lot of resources online to learn how to use it. We are only giving some instructions related to the usage of R scripts in batch jobs.

R is installed on the HPC cluster, and accessible from all the node. As usual, more than one version is installed, to provide compatibility with scripts written for older versions. For R, the default behaviour is to start in interactive mode. This can be useful for testing, for small/short analysis that require the user supervision, or to use graphics, for example. To start the latest installed version from a thinlinc session, you can select it from the menu: “Applications Menu” -> DTU -> Statistics -> R.

Alternatively, open a terminal, and type the program full path:

/appl/R/bin/R-3.4.1

NOTE

If you need a version that is not the latest one, you only can start it from the command line, specifying the correct path. To get the list of the versions that are available type the command:
ls /appl/R/bin/R-*

R in batch mode

Once you have collected all the R instructions in a textfile, for example My_script.R, there is more than one way to run it in batch mode, i.e. without opening an interactive R session. The two main options are:

  • R CMD BATCH My_script.R
  • Rscript My_script.R

Both are available on the HPC cluster, so in principle both can be used. In both cases, the R program is called, and the R instructions written in the file My_script.R are executed sequentially.

One big difference is that using Rscript the output of the command is sent to the stdout stream, while with R CMD BATCH ... the output is not sent to the stdout, but both an echo of the commands and the output are saved into a file called, by default, My_script.Rout.

The first way, R CMD BATCH ... is the most traditional way.
By explicitly using the path of the R (or Rscript) executable, you can also select a specific R (or Rscript) version. Anyway, both options are suitable to be used in a batch script to be submitted on the cluster.

R serial batch job

Once you have a working R-script that can be successfully run from the command line, you have to make sure that the script does not try to print any graphics. When running on the HPC batch system, no graphical feature are available, and calling them will most likely cause an error, and the job will be killed. You can of course create and save graphical objects to files, but not “print them to the screen”. Then a simple script for to submit on the cluster could be:

#!/bin/sh 
### General options 
### -- specify queue -- 
#BSUB -q hpc
### -- set the job Name -- 
#BSUB -J My_Serial_R
### -- ask for 1 core -- 
#BSUB -n 1 
### -- specify that we need 2GB of memory per core/slot -- 
#BSUB -R "rusage[mem=2GB]"
### -- specify that we want the job to get killed if it exceeds 3 GB per core/slot -- 
#BSUB -M 3GB
### -- set walltime limit: hh:mm -- 
#BSUB -W 24:00 
### -- set the email address -- 
# please uncomment the following line and put in your e-mail address,
# if you want to receive e-mail notifications on a non-default address
##BSUB -u your_email_address
### -- send notification at start -- 
#BSUB -B 
### -- send notification at completion -- 
#BSUB -N 
### -- Specify the output and error file. %J is the job-id -- 
### -- -o and -e mean append, -oo and -eo mean overwrite -- 
#BSUB -o Output_%J.out 
#BSUB -e Error_%J.err

R_exe=/appl/R/bin/R-3.4.1
#export TMPDIR=Path_to_your_scratch_directory
export R_BATCH_OPTIONS="--no-save"
# -- commands you want to execute -- # 
$R_exe CMD BATCH My_serial_script.R

For a description of the meaning of the #BSUB options please refer to this page

A few options are important when running on the cluster.

  • If you have a directory on one of the scratch filesystems, please tell R to use it, with the line
    export TMPDIR=Path_to_your_scratch_directory
    Replace the string Path_to_your_scratch_directory with the absolute path to your scratch directory, of course. In this way most of the files temporary files that R creates are written to/read from that filesystem, that is more suitable for frequent I/O, and you do not risk to fill up your quota inadvertently. If you do not have a directory on the scratch filesystem and you think you will need one, write us at support@hpc.dtu.dk.
  • When run in the batch mode, R by default saves a copy of the workspace in a .RData file in the same directory where the command is run from. These files can be quite large. If you do not need a copy of the full workspace, for example because you do not plan to re-load it in another session, please tell R not to save it, adding the line:
    export R_BATCH_OPTIONS="--no-save"

R Parallel batch job

Parallelism is present in R since version 2.14, through the parallel package, that allows to run in parallel on many core in one single node, or on many nodes. R can also make use of multithreaded libraries (like OpenBLAS) and therefore run on many cores on a single machine, or use other packages that provide different levels (and kind) of parallelism, as described for example in this page. Therefore it is not easy to provide a template that fits for all the kind of usages, present and especially future. However, we want to provide some general hints.

  • Going parallel does not guarantee that there is an improvement. Depending on the problem specific nature and the package chosen, there can be little or no gain in speed, and where there is a speedup, it can be only meaningful up to a certain number of cores/nodes used, and then the performance gets worse. In some situation, it can also happen that the parallelized problem needs more memory, even more than the total memory available on the machine.
    Therefore, please just do not run your job in parallel if you have not previously tested that it is really worth it. If you need help, please write us at support@hpc.dtu.dk, and we will try to help.
  • According to our experience, some of the parallel packages are not mature enough to be used out-of-the-box in a cluster environment. Two basic “mistakes” is that these packages, by default assume that:
    • the user has a machine exclusively for himself/herself, and therefore all the processors/cores available on the machines can be used by R.
    • there are enough resources for nested parallelism: one process can create N child-processes and these can also create N child-processes, and so on.

    Both the assumptions are wrong. In our cluster, unless you require a full node for yourself, it is likely that you are sharing the resources with someone else. And even if you are alone, nested parallelism is not necessarily a good option, because using more threads/processes than the number of available physical cores makes the run simply slower.
    Therefore, please find a way to pass/force the total number of processes to the R script explicitly, do not rely on the auto-discovery mechanism of the package.

That said, a generic template for the R parallel on a single node run could look like the following:

#!/bin/sh 
### General options 
### -- specify queue -- 
#BSUB -q hpc
### -- set the job Name -- 
#BSUB -J My_Parallel_R
### -- ask for 1 core -- 
#BSUB -n 10 
### -- specify that the cores must be on the same host -- 
#BSUB -R "span[hosts=1]"
### -- specify that we need 2GB of memory per core/slot -- 
#BSUB -R "rusage[mem=2GB]"
### -- specify that we want the job to get killed if it exceeds 3 GB per core/slot -- 
#BSUB -M 3GB
### -- set walltime limit: hh:mm -- 
#BSUB -W 24:00 
### -- set the email address -- 
# please uncomment the following line and put in your e-mail address,
# if you want to receive e-mail notifications on a non-default address
##BSUB -u your_email_address
### -- send notification at start -- 
#BSUB -B 
### -- send notification at completion -- 
#BSUB -N 
### -- Specify the output and error file. %J is the job-id -- 
### -- -o and -e mean append, -oo and -eo mean overwrite -- 
#BSUB -o Output_%J.out 
#BSUB -e Error_%J.err

R_exe=/appl/R/bin/R-3.4.1
#export TMPDIR=Path_to_your_scratch_directory
export R_BATCH_OPTIONS="--no-save"
# -- commands you want to execute -- # 
$R_exe CMD BATCH My_parallel_script.R

The main difference in the script is that we ask for more than one core

#BSUB -n 10
#BSUB -R "span[hosts=1]"

In this example, the request is for 10 cores on a single node. Then, instead of letting the R-package to discover how many cores are available, read from inside R the environment variable LSB_DJOB_NUMPROC, for example with

cores <- as.numeric(Sys.getenv('LSB_DJOB_NUMPROC'))

And use then this variable to set the amount of processes/threads in your R run. In this way, you are sure that:

  • you do not start more processes than the cores that you have actually requested.
  • you do not have to change the R-script, if you decide to make another run with a different number of cores.

NOTE

Running R across multiple nodes is not trivial. A lot depends on the specific package providing the parallelism. And almost all these packages rely on an underlying communication layer, and even if the package installation is successful, the communication across nodes can be impossible. For this reason, at the moment R cannot run on our cluster on more than one node. If you need to do this, write us, and we will look into the specific case.