Q: GPU

If the program requires GPUs, the job must reserve one or more GPUs, and it must be submitted to a queue with GPUs. The syntax for reserving GPUs is: #BSUB -gpu \"num=1:mode=exclusive_process\" Always request the GPU with the clause mode=exclusive_process. Otherwise it is shared with other users and if runs out of memory, all jobs of all users will be killed. To request more than 1 GPU, just change the num=1 part. You can check how many GPUs one node have with the command nodestat -g queue_name Do not request more than 1 GPU if you are not sure that your program can actually make use of them.

Q: Total amount or RAM

It is necessary to specify some amount of RAM for the job, otherwise the job is assigned a default value (2GB per core). Start by overestimating the need by a reasonable amount, and then adjust the value for the next jobs based on the actual usage. For example, if the estimate is 32 GB, and you are requesting 1 core, use the lines: #BSUB -n 1 #BSUB -R \"rusage[mem=32GB]\" If instead you are requesting 8 cores, remember that the memory requirement is per core. so it should be: #BSUB -n 8 #BSUB -R \"rusage[mem=4GB]\" One can always check how much memory is available on the nodes with a specific queue, with the command nodestat -F queue_name

Q: walltime

Different queues have different maximum walltime. Try to estimate the walltime, with a reasonable margin. Asking for the maximum walltime even if it is not necessary can lead to longer waiting time. Remember that the actual runtime of the same job can differ if it runs on nodes with different CPU models.

Q: CPU cores utilization

To monitor the CPU utilization one can use the command: bstat -C job-id where job-id is the id of the specific job. Note: You have to wait at least a few minutes after the job started to have statistically significant measurements. A good output could be JOBID USER QUEUE JOB_NAME NALLOC ELAPSED EFFIC 112233 s123456 hpc job_123 20 3:47:38 98.34 This shows that this job named job_123, requesting 20 cores, has an average efficiency, after almost 4 hours, of 98.34. This means that the 20 cores are used almost at 100% on average. A not-so-good output could be: JOBID USER QUEUE JOB_NAME NALLOC ELAPSED EFFIC 112234 s123456 hpc job_124 24 24:28:56 49.60 Here the job is requesting 24 cores, but only using them at 49.60 efficiency. This can be due to the fact that the user requested 24 cores but instructed the program to only use 12. Or it can be that the program is actually using 24 cores, but they are only used, on average, at 50%, or 50% of the time. In both cases, this job needs attention . It is a good idea to do some extra tests in an interactive session, where users have direct access to debugging tools. Note: a job with 100% efficiency can potentially hide an oversubscription problem, where the program is running a number of processes larger than the number of cores reserved. But this is more difficult to debug, on the cluster.

Q: Memory utilization

Ideally a job should request enough memory for the execution, but not much more than this amount. Otherwise, the scheduler assumes that the job will need all of the requested memory, and will not schedule other jobs to the same machine. If the reserved memory is not used, the CPU cores that are not reserved by the same job will be simply wasted, and other users will have to wait. For this reason, there is also a Total memory limit enforced, so that a single user cannot use more than a Maximum amount of RAM simultaneously in a cluster queue. So requesting memory and not using it always a bad idea . To monitor the a job memory utilization while the job is running, use the command: bstat -M job-id where job-id is the id of the specific job. A bad job looks like this: JOBID USER QUEUE JOB_NAME NALLOC MEM MAX AVG LIM 112235 s123456 hpc job_bad 8 804.0M 804.0M 804.0M 160.0G Here one can see that the job is using only a peak memory of 804 MB but instead requested 160 GB almost 200 time larger.

Q: GPU utilization

Debugging the GPU utilization in a batch job is not trivial. The best recommendation is to make some test runs in interactive sessions and check: that the GPU is actually used; if multiple GPUs are requested, that all the GPUs are used (not all programs support multi-GPUs, and if they do, it is not automatic); the efficiency of the GPU usage.

Q: Efficiency

The ratio CPU time / Run time gives the average number of cores that have been used during the job. This number should be as close as possible to the number of cores requested. In the current example, it is 24524/3163~ 7.8 that is fairly close to the 8 cores requested. If the ratio is significantly smaller than the number of requested cores, then there are most likely scaling issues .

Q: Memory usage

The memory usage is clearly reported. The entry Delta Memory show how much excess memory have been requested, the difference between the Total Requested Memory and the Max Memory . This should ideally be minimized. The job in the example is a bad example , requesting 131072 MB and only using 2265 MB.

Q: Oversubscription

Oversubscribing happens when the program starts more processes/threads than the number of physical CPU cores requested . This is quite normal, and it is not a problem if the number of processes/threads that are actually doing computations is smaller or equal to the number of physical cores. But when a program, for example, requests 1 single CPU core but starts 24 computational threads, then all those threads will have to take turns running on the same core, slowing down each other significantly. Unfortunately, this is not visible in the efficiency, because the efficiency will always be 100%. LSF reports Max Processes and Max Threads , and those can be used to get a hint about potential oversubscription. If the number of processes/threads is very large compared to the requested cores, it could be a good idea to have a closer look at the program. But it is only a hint. Complex programs like MATLAB , COMSOL , ANSYS start a lot of auxiliary processes/threads. On the contrary, programs/libraries that are known to be prone to oversubscription are, among others, julia , python's multiprocessing library, some R libraries , the cplex library, gams, gurobi .

Question 1

CPU cores/number of nodes

Accepted Answer

Is the program serial (i.e. it is not parallel)?

Then select 1 core only.

#BSUB -n 1 #BSUB -R \"span[hosts=1]\"

Is the program parallel?

Well, it is up to the user to determine a suitable number of CPU cores. Remember that not all programs scale well, so there can be a sweet spot, a number of core that gives the best compromise between the amount of resources and execution speed. Have a look here, if in doubt.

Broadly speaking, parallelism can come in different flavours:

Shared memory parallelism: programs can start different threads that communicate through the system memory (RAM), so they can only run on a single node.
Distributed memory parallelism: program can start different processes that communicate via the network (using Ethernet and/or Infiniband), so they can run on multiple nodes.

For shared memory programs:
select a number of cores that does not exceed the maximum number of cores on a single machine. And in the jobscript use the syntax:

#BSUB -n N #BSUB -R \"span[hosts=1]\"

where N stands for the number of CPU cores you want to use.

For distributed memory programs:
select the total number of CPU cores you want to use, and the number of cores per node/machine with the syntax:

#BSUB -n N #BSUB -R \"span[ptile=M]\"

where N is the total number of cores requested, with M cores reserved on each machine. The number of machines needed will be automatically determined by the scheduler. Clearly, only M < N is meaningful.

Question 2

GPU

Accepted Answer

If the program requires GPUs, the job must reserve one or more GPUs, and it must be submitted to a queue with GPUs.

The syntax for reserving GPUs is:

#BSUB -gpu \"num=1:mode=exclusive_process\"

Always request the GPU with the clause mode=exclusive_process. Otherwise it is shared with other users and if runs out of memory, all jobs of all users will be killed.
To request more than 1 GPU, just change the num=1 part.
You can check how many GPUs one node have with the command nodestat -g queue_name

Do not request more than 1 GPU if you are not sure that your program can actually make use of them.

Question 3

Total amount or RAM

Accepted Answer

It is necessary to specify some amount of RAM for the job, otherwise the job is assigned a default value (2GB per core).
Start by overestimating the need by a reasonable amount, and then adjust the value for the next jobs based on the actual usage. For example, if the estimate is 32 GB, and you are requesting 1 core, use the lines:

#BSUB -n 1 #BSUB -R \"rusage[mem=32GB]\"

If instead you are requesting 8 cores, remember that the memory requirement is per core. so it should be:

#BSUB -n 8 #BSUB -R \"rusage[mem=4GB]\"

One can always check how much memory is available on the nodes with a specific queue, with the command nodestat -F queue_name

Question 4

walltime

Accepted Answer

Different queues have different maximum walltime. Try to estimate the walltime, with a reasonable margin. Asking for the maximum walltime even if it is not necessary can lead to longer waiting time.
Remember that the actual runtime of the same job can differ if it runs on nodes with different CPU models.

Question 5

CPU cores utilization

Accepted Answer

To monitor the CPU utilization one can use the command:

bstat -C job-id

where job-id is the id of the specific job.

Note: You have to wait at least a few minutes after the job started to have statistically significant measurements.

A good output could be

JOBID USER QUEUE JOB_NAME NALLOC ELAPSED EFFIC 112233 s123456 hpc job_123 20 3:47:38 98.34

This shows that this job named job_123, requesting 20 cores, has an average efficiency, after almost 4 hours, of 98.34. This means that the 20 cores are used almost at 100% on average.

A not-so-good output could be:

JOBID USER QUEUE JOB_NAME NALLOC ELAPSED EFFIC 112234 s123456 hpc job_124 24 24:28:56 49.60

Here the job is requesting 24 cores, but only using them at 49.60 efficiency.
This can be due to the fact that the user requested 24 cores but instructed the program to only use 12. Or it can be that the program is actually using 24 cores, but they are only used, on average, at 50%, or 50% of the time. In both cases, this job needs attention.

It is a good idea to do some extra tests in an interactive session, where users have direct access to debugging tools.

Note: a job with 100% efficiency can potentially hide an oversubscription problem, where the program is running a number of processes larger than the number of cores reserved. But this is more difficult to debug, on the cluster.

Question 6

Memory utilization

Accepted Answer

Ideally a job should request enough memory for the execution, but not much more than this amount. Otherwise, the scheduler assumes that the job will need all of the requested memory, and will not schedule other jobs to the same machine. If the reserved memory is not used, the CPU cores that are not reserved by the same job will be simply wasted, and other users will have to wait.

For this reason, there is also a Total memory limit enforced, so that a single user cannot use more than a Maximum amount of RAM simultaneously in a cluster queue. So requesting memory and not using it always a bad idea.

To monitor the a job memory utilization while the job is running, use the command:

bstat -M job-id

where job-id is the id of the specific job.

A bad job looks like this:

JOBID USER QUEUE JOB_NAME NALLOC MEM MAX AVG LIM 112235 s123456 hpc job_bad 8 804.0M 804.0M 804.0M 160.0G

Here one can see that the job is using only a peak memory of 804 MB but instead requested 160 GB almost 200 time larger.

Question 7

GPU utilization

Accepted Answer

Debugging the GPU utilization in a batch job is not trivial. The best recommendation is to make some test runs in interactive sessions and check:

that the GPU is actually used;
if multiple GPUs are requested, that all the GPUs are used (not all programs support multi-GPUs, and if they do, it is not automatic);
the efficiency of the GPU usage.

Question 8

Efficiency

Accepted Answer

The ratio

CPU time / Run time

gives the average number of cores that have been used during the job. This number should be as close as possible to the number of cores requested.
In the current example, it is 24524/3163~7.8 that is fairly close to the 8 cores requested.

If the ratio is significantly smaller than the number of requested cores, then there are most likely scaling issues.

Question 9

Memory usage

Accepted Answer

The memory usage is clearly reported. The entry Delta Memory show how much excess memory have been requested, the difference between the Total Requested Memory and the Max Memory. This should ideally be minimized. The job in the example is a bad example, requesting 131072 MB and only using 2265 MB.

Question 10

Oversubscription

Accepted Answer

Oversubscribing happens when the program starts more processes/threads than the number of physical CPU cores requested.
This is quite normal, and it is not a problem if the number of processes/threads that are actually doing computations is smaller or equal to the number of physical cores.
But when a program, for example, requests 1 single CPU core but starts 24 computational threads, then all those threads will have to take turns running on the same core, slowing down each other significantly.
Unfortunately, this is not visible in the efficiency, because the efficiency will always be 100%.

LSF reports Max Processes and Max Threads, and those can be used to get a hint about potential oversubscription.

If the number of processes/threads is very large compared to the requested cores, it could be a good idea to have a closer look at the program. But it is only a hint. Complex programs like MATLAB, COMSOL, ANSYS start a lot of auxiliary processes/threads.
On the contrary, programs/libraries that are known to be prone to oversubscription are, among others, julia, python's multiprocessing library, some R libraries, the cplex library, gams, gurobi.

Job workflow, or how to setup and monitor a job

Before the job submission: estimating the resources needed

Is the program serial (i.e. it is not parallel)?

Is the program parallel?

When the job is running

After the job finished