Batch Jobs under LSF 10


Basic picture

HPC is a multi-user environment. In order to ensure an effective usage of the available resources some sort of management of the workload is necessary. This task is accomplished by a Resource Manager (RM), a program (or sometimes a set of programs that work together) that assigns the resources to the users according to the system current and expected load, the user needs, and a predefined assignment policy. This means that the user does not run his application directly, but instead “asks” the system to run his application, usually by means of a job script. The Resource Manager parses the script, and tries to optimize the usage of the available resources, scheduling the execution of the applications (jobs) at different times, and on different nodes/cores in the cluster. There are many Resource Managers available. The one currently installed on the HPC is IBM Spectrum LSF. However, most of the description in this guide is general. You can find some specific information at IBM Spectrum LSF.

The HPC system is a shared resource and therefore there are resource limitations for each user. These are in place so there are enough resources for everybody at nearly all times. These limitations are constantly monitored and adjusted to the current load of the HPC.

The job script

The Resource Manager takes care of assigning to the user jobs the requested resources so that many jobs can be run simultaneously without interfering with each other, and scheduling the execution of the different requests. The RM needs some user-provided information to be able to do a good job, that’s why the user is required to provide a job script (also called batch file) with the specification of:

  • the resources requested, and the job constraints;
  • specific queue specification (more details later);
  • all what is needed for a working execution-environment.

By resources we mean the number of processors/cores/nodes, the amount of memory needed (total and/or per process), and eventually some specific features (special hardware, for example), and so on.

Job constraints are typically time constraints, i.e. for how much time you are reserving the resources. Notice that there are limitations, both for the maximum number of cores, jobs, and time.

The concept of queue is quite intuitive: since not all the applications/jobs are run immediately, they are ordered in a queue for a delayed execution. However, many different queues can be managed by the same scheduler. At DTU, for example, the RM manages different clusters, some of which are for all users and some only for selected groups, also different queue for different kind of applications (serial vs parallel, short vs long, and others), with different restrictions.

Then you have to provide a functioning execution environment: remember that your application will be run when scheduled by the RM, so the application must be able to run without user intervention (sometimes called unattended execution). This means that the correct specification of the executables must be provided, with all the necessary input files, and a correct specification of the environment (for example libraries, modules…).

Notice that the RM reserves the resources for your job (and run it when there are enough, avoiding conflicts with other users’ jobs) based on your request. It is therefore important that the resources requested correspond to what really is needed by your application.

All these information must be written in a text file, following a simple syntax, that will be shown later.

Once you have this text file (the job script, let us call it submit.sh), you must submit it by typing in a terminal the command:

$ bsub < submit.sh

Then you can check the status of your submission issuing the command

$ bstat

More information about the commands can be found here.

NOTE: at present, there is no way to submit jobs other than from the command line

Preparing a job script

The first thing when preparing a batch job is to take care that the application can run unattended, without your supervision. So you have to make sure that the executables are in the right place, that all the input files are specified, and that the output gets written where expected. Let us assume that everything is set up properly, and that you would run your application from the command line as follows:

myapplication.x < input.in > output.out

that means that you run the program myapplication.x reading the input from the file input.in and writing the output in the file output.out in the directory where you are when you issue the command.

NOTE:

  • An important part of the execution environment is the location of files. You have to take care that the working directory, the directory where you want your files to be read and write for example, is correctly specified.
  • It could be necessary to specify the full path of your application, to be sure that everything goes as expected.
  • If your application needs some special environment (e.g. some special libraries), you have to make sure that they are loaded before the execution.

Example: a not so basic job script

A job script for submitting the same program to a queue, could look like that:

#!/bin/sh 
### General options 
### -- specify queue -- 
#BSUB -q hpc
### -- set the job Name -- 
#BSUB -J My_Application
### -- ask for number of cores (default: 1) -- 
#BSUB -n 4 
### -- specify that the cores must be on the same host -- 
#BSUB -R "span[hosts=1]"
### -- specify that we need 4GB of memory per core/slot -- 
#BSUB -R "rusage[mem=4GB]"
### -- specify that we want the job to get killed if it exceeds 5 GB per core/slot -- 
#BSUB -M 5GB
### -- set walltime limit: hh:mm -- 
#BSUB -W 24:00 
### -- set the email address -- 
# please uncomment the following line and put in your e-mail address,
# if you want to receive e-mail notifications on a non-default address
##BSUB -u your_email_address
### -- send notification at start -- 
#BSUB -B 
### -- send notification at completion -- 
#BSUB -N 
### -- Specify the output and error file. %J is the job-id -- 
### -- -o and -e mean append, -oo and -eo mean overwrite -- 
#BSUB -o Output_%J.out 
#BSUB -e Output_%J.err 

# here follow the commands you want to execute with input.in as the input file
myapplication.x input.in > output.out

The file starts with the line

#!/bin/sh

that tells the system to call the sh command line interpreter (shell) for interpreting the subsequent lines.

The lines that start with # are comments in the shell environment, and so the rest of the line is not executed.
Among these, the lines starting with #BSUB are interpreted by the RM as lines that contain options for the resource manager.
After the section with the options for the Resource Manager, you have to insert the command(s) for running your programs.

Some of the most important options for the RM are shown in the script:

#BSUB -J My_Application

Specify the name of your job (-J flag), that is useful to easily check the status of your job.

#BSUB -q hpc

Specify the queue you want your job to be run in (-q flag). Notice that different queues have different defaults, and access to specific queue can be restricted to specific groups of users.

#BSUB -W 24:00

Specifies that you want your job to run AT MOST 24:00 (24 hours and 0 minutes) (-W flag).

#BSUB -n 4

Ask to reserve 4 cores (processors). This number is the total number of cores, that could be on one or on more than one node.
It is important to specify how the user wants the cores to be distributed across nodes. There are three main options (-R “span[XXX=YYY]” syntax and sub-cases):

#BSUB -R "span[hosts=1]"

This means that all the cores must be on one single host. Therefore it is not possible to request more cores than the number of physical cores present on a machine.

#BSUB -R "span[ptile=N]"

This means that the scheduler will reserve the cores in groups of size N, up to the total number of cores requested with the-n flag. Each group of N-cores will be assigned to a separate physical machine. Only programs supporting distributed parallelism (e.g. MPI programs) can be run in this way.

#BSUB -R "span[block=N]"

This means that the scheduler will reserve the cores in groups of size N up to the total number of cores requested with the-n flag (like in the ptile case, but two or more groups of N cores could be assigned to the same physical machine. Only programs supporting distributed parallelism (e.g. MPI programs) should be run in this way.

#BSUB -R "rusage[mem=4GB]"

(-R “rusage[mem=YYY]” syntax and sub-cases) means that your job will be run on a machine that has at least 4GB per core (slot) of memory available. So in our case with -n 4 and -R "span[hosts=1], the job will be dispatched to a machine with at least 16 GB or RAM available.

#BSUB -M 5GB

(-M flag) specifies the per-process memory limit for all the processes of your job. In our case, with -n 4 and -R "span[hosts=1], the job will be killed when it exceeds 20 GB of RAM. If we had requested 2 nodes, with 2 cores each (-n 4 and -R "span[ptile=2]), the job would be killed when it exceeds 10 GB of RAM on one of the nodes. Both these memory specification accept KB, MB, GB, TB as units. And in both cases there must be no space between the number and the unit.

#BSUB -u your_email_address

User email address (-u flag), so that the user can receive notifications

#BSUB -B

Specifies to write an email to the user address when the job begins, and

#BSUB -N

when the job ends, 

#BSUB -Ne

only if the job fails.

#BSUB -o Output_%J.out 
#BSUB -e Output_%J.err

Specify the user want the standard output (-o flag) and the standard error (-e flag) streams saved to the named files. The %J is expanded at run time to the job-id, so these two files are unique. This is not necessary, though.

NOTES:

  • Remember that the unix is case sensitive, and so uppercase and lowercase letters have different meanings, even in the job script specifications.
  • Avoid using special characters, special Danish letters and spaces in the name of files, directories, and the job name. The script could be interpreted by the scheduler in unexpected ways.
  • The default behaviour in LSF is to append the content of the file. So if you don’t use unique names (like in the example), subsequent runs will add to the same file. To change the default behaviour to overwrite, use the flags -oo and -eo.
  • If you do not specify the error file but only the output one, the standard output and standard error streams are merged and saved into the output file.
  • The filename after the -o, -e flag can be a full path. If it is the name of an existing directory (ending with “/”), then LSF will automatically save as file with a name <jobid>.out.
  • If you don’t specify to save the output file somewhere, the content of the standard output and error streams will be appended to the report that is sent after completion (up to a system-predefined maximum size).

You can now write the commands you use for running your application (with input.in as the input file):

myapplication.x input.in > output.out

This section can be a full script, or even better you could build an external script with all the instruction you need, for example a bash, a python, a perl script, and call it from the submit script.

Notice that the application are managed in a modular approach on the HPC (see modules). This means that the software packages you need are probably not already available on the node where your application is run. So you have to take care to explicitly load the modules you need, issuing the correct

module load <list of the modules the program needs>

before the command that executes your job.

Requesting specific CPU models/resources

In some cases it can be important to be able to select machines with specific features, like the CPU type, or the instruction set supported by the CPU. In LSF this is possible with the select strings in the resource requirement option.
For example, the line

#BSUB -R "select[model == XeonE5_2660v3]"

tells the scheduler to run the job on a machine with the XeonE5_2660v3 CPU model only.

#BSUB -R "select[avx2]"

tells the scheduler to dispatch the job to a machine that supports the avx2 instruction set.

To get a list of the the resources/model that can be selected, use the command nodestat, followed by the queue name, for example:

nodestat -F hpc

More details  here.

Important NOTE

Not all the options mentioned above are strictly necessary, and for some of them there are default values, that means that when they are missing, the default values for all the “settings” are used. However, this is not always a good idea. Actually, it is almost never a good idea. The defaults can be different for different installations, or for different parts of the cluster in the same installation, and can also be changed without notification to the user. Therefore it is better to specify at least the most important options.

The existence of some parameters is so important, that we decided to enforce them if they are missing. These are:

  • jobname (option -J)
    When missing, it is set to -J NONAME.
  • walltime limit (option -W)
    When missing, it is set to -W 15 (15 minutes).
  • output file name (options -o, -oo)
    When missing, it is set to -o jobname_%J.out.
  • memory limit (option -M)
    When missing, if -R “rusage[mem=XXX]” is present, it is set to -M XXX, otherwise to -M 1024MB.
  • memory requested (option -R “rusage[mem=XXX]”)
    When missing, if -M XXX is present, it is set to -R “rusage[mem=XXX]”, otherwise it is set to -R “rusage[mem=1024MB]”.
  • core distribution across nodes (option -R “span[XXX=YYY]”)
    If missing, when more than one core is requested, it is set to -R “span[hosts=1]”, i.e. single node.

These setting are shown as warning at the moment of the submission of the job. You are encouraged to set your own values, however, because these settings are far from ideal in most of the cases.