Job workflow, or how to setup and monitor a job

In the usage of a cluster, the job submission is not necessarily the last step.
A good efficiency of the job is in the interest of every user, so that the results can be obtained faster.
But it is also in the interest of all the users of the cluster: if the resources are used efficiently, then more resources are available for all the users.

In this page we give some hints on what to do before the job submission, to prepare a jobscript that is suitable for the user’s needs, and after the job submission, to make sure that the job is running as expected.

Before the job submission: estimating the resources needed

When writing a jobscript, the user have to decide how many resources to use. The most relevant are

CPU cores/number of nodes
GPU
total amount of RAM
walltime

In an ideal world the user should know precisely what the program to run is capable of, and what resources it needs. But in reality, sometimes it is a matter of testing.

CPU cores/number of nodes

Is the program serial (i.e. it is not parallel)?

Then select 1 core only.

#BSUB -n 1
#BSUB -R "span[hosts=1]"

Is the program parallel?

Well, it is up to the user to determine a suitable number of CPU cores. Remember that not all programs scale well, so there can be a sweet spot, a number of core that gives the best compromise between the amount of resources and execution speed. Have a look here, if in doubt.

Broadly speaking, parallelism can come in different flavours:

Shared memory parallelism: programs can start different threads that communicate through the system memory (RAM), so they can only run on a single node.
Distributed memory parallelism: program can start different processes that communicate via the network (using Ethernet and/or Infiniband), so they can run on multiple nodes.

For shared memory programs:
select a number of cores that does not exceed the maximum number of cores on a single machine. And in the jobscript use the syntax:

#BSUB -n N
#BSUB -R "span[hosts=1]"

where N stands for the number of CPU cores you want to use.

For distributed memory programs:
select the total number of CPU cores you want to use, and the number of cores per node/machine with the syntax:

#BSUB -n N
#BSUB -R "span[ptile=M]"

where N is the total number of cores requested, with M cores reserved on each machine. The number of machines needed will be automatically determined by the scheduler. Clearly, only M < N is meaningful.

GPU

If the program requires GPUs, the job must reserve one or more GPUs, and it must be submitted to a queue with GPUs.

The syntax for reserving GPUs is:

#BSUB -gpu "num=1:mode=exclusive_process"

Always request the GPU with the clause mode=exclusive_process. Otherwise it is shared with other users and if runs out of memory, all jobs of all users will be killed.
To request more than 1 GPU, just change the num=1 part.
You can check how many GPUs one node have with the command nodestat -g queue_name

Do not request more than 1 GPU if you are not sure that your program can actually make use of them.

Total amount or RAM

It is necessary to specify some amount of RAM for the job, otherwise the job is assigned a default value (2GB per core).
Start by overestimating the need by a reasonable amount, and then adjust the value for the next jobs based on the actual usage. For example, if the estimate is 32 GB, and you are requesting 1 core, use the lines:

#BSUB -n 1
#BSUB -R "rusage[mem=32GB]"

If instead you are requesting 8 cores, remember that the memory requirement is per core. so it should be:

#BSUB -n 8
#BSUB -R "rusage[mem=4GB]"

One can always check how much memory is available on the nodes with a specific queue, with the command nodestat -F queue_name

walltime

Different queues have different maximum walltime. Try to estimate the walltime, with a reasonable margin. Asking for the maximum walltime even if it is not necessary can lead to longer waiting time.
Remember that the actual runtime of the same job can differ if it runs on nodes with different CPU models.

When the job is running

We collected the commands to check the status of a job here. In this section we want to provide some hints on how to use them to investigate the job performance. Two critical metrics that need to be checked are:

CPU core utilization: are all the cores used all of the time?
Memory utilization: is the amount of memory used close to the amount of memory reserved?
GPU utilization: are the GPUs used efficiently?

A bad utilization of the CPUs, GPUs and memory is a waste of the resources affecting all the users.

We provide a few command line tools to investigate the job efficiency, here we give some hints on how to use them.

CPU cores utilization

To monitor the CPU utilization one can use the command:

bstat -C job-id

where job-id is the id of the specific job.

Note: You have to wait at least a few minutes after the job started to have statistically significant measurements.

A good output could be

JOBID      USER      QUEUE      JOB_NAME   NALLOC    ELAPSED     EFFIC
112233     s123456   hpc        job_123        20    3:47:38     98.34

This shows that this job named job_123, requesting 20 cores, has an average efficiency, after almost 4 hours, of 98.34. This means that the 20 cores are used almost at 100% on average.

A not-so-good output could be:

JOBID      USER      QUEUE      JOB_NAME   NALLOC    ELAPSED     EFFIC
112234     s123456   hpc        job_124        24    24:28:56    49.60

Here the job is requesting 24 cores, but only using them at 49.60 efficiency.
This can be due to the fact that the user requested 24 cores but instructed the program to only use 12. Or it can be that the program is actually using 24 cores, but they are only used, on average, at 50%, or 50% of the time. In both cases, this job needs attention.

It is a good idea to do some extra tests in an interactive session, where users have direct access to debugging tools.

Note: a job with 100% efficiency can potentially hide an oversubscription problem, where the program is running a number of processes larger than the number of cores reserved. But this is more difficult to debug, on the cluster.

Memory utilization

Ideally a job should request enough memory for the execution, but not much more than this amount. Otherwise, the scheduler assumes that the job will need all of the requested memory, and will not schedule other jobs to the same machine. If the reserved memory is not used, the CPU cores that are not reserved by the same job will be simply wasted, and other users will have to wait.

For this reason, there is also a Total memory limit enforced, so that a single user cannot use more than a Maximum amount of RAM simultaneously in a cluster queue. So requesting memory and not using it always a bad idea.

To monitor the a job memory utilization while the job is running, use the command:

bstat -M job-id

where job-id is the id of the specific job.

A bad job looks like this:

JOBID      USER     QUEUE      JOB_NAME   NALLOC    MEM     MAX     AVG     LIM
112235     s123456  hpc        job_bad         8  804.0M  804.0M  804.0M  160.0G

Here one can see that the job is using only a peak memory of 804 MB but instead requested 160 GB almost 200 time larger.

GPU utilization

Debugging the GPU utilization in a batch job is not trivial. The best recommendation is to make some test runs in interactive sessions and check:

that the GPU is actually used;
if multiple GPUs are requested, that all the GPUs are used (not all programs support multi-GPUs, and if they do, it is not automatic);
the efficiency of the GPU usage.

After the job finished

After the job terminates, the scheduler produces a report with some important pieces of information regarding the job. This can be used to evaluate if the original assumptions about the job requirements were adequate or not, and adjust them for future jobs.

Here is an example of a snippet from the last part of the report:

Resource usage summary:

    CPU time :                                   24524.00 sec.
    Max Memory :                                 2265 MB
    Average Memory :                             2145.12 MB
    Total Requested Memory :                     131072.00 MB
    Delta Memory :                               128807.00 MB
    Max Swap :                                   1 MB
    Max Processes :                              21
    Max Threads :                                57
    Run time :                                   3163 sec.
    Turnaround time :                            3100 sec.

Read file  for stdout output of this job.
Read file  for stderr output of this job.

Highlighted in colours are the sections with information for the efficiency, memory usage and hints at oversubscription.

Efficiency

The ratio

CPU time / Run time

gives the average number of cores that have been used during the job. This number should be as close as possible to the number of cores requested.
In the current example, it is 24524/3163~7.8 that is fairly close to the 8 cores requested.

If the ratio is significantly smaller than the number of requested cores, then there are most likely scaling issues.

Memory usage

The memory usage is clearly reported. The entry Delta Memory show how much excess memory have been requested, the difference between the Total Requested Memory and the Max Memory. This should ideally be minimized. The job in the example is a bad example, requesting 131072 MB and only using 2265 MB.

Oversubscription

Oversubscribing happens when the program starts more processes/threads than the number of physical CPU cores requested.
This is quite normal, and it is not a problem if the number of processes/threads that are actually doing computations is smaller or equal to the number of physical cores.
But when a program, for example, requests 1 single CPU core but starts 24 computational threads, then all those threads will have to take turns running on the same core, slowing down each other significantly.
Unfortunately, this is not visible in the efficiency, because the efficiency will always be 100%.

LSF reports Max Processes and Max Threads, and those can be used to get a hint about potential oversubscription.

If the number of processes/threads is very large compared to the requested cores, it could be a good idea to have a closer look at the program. But it is only a hint. Complex programs like MATLAB, COMSOL, ANSYS start a lot of auxiliary processes/threads.
On the contrary, programs/libraries that are known to be prone to oversubscription are, among others, julia, python’s multiprocessing library, some R libraries, the cplex library, gams, gurobi.