GPU jobs: best practice guide

On this page we provide some best practice for jobs using GPUs, like requesting the right resources, how to monitor your jobs, etc

Requesting CPU resources in a GPU job

An application using GPUs starts always on the host CPU, and during execution, the CPU is “in charge” of launching the instructions (aka “kernels”) on the GPU, and moving data to and from the GPU. Many applications will do this multi-threaded on the CPU, so it is important to assign a certain amount of CPU cores to your GPU job.

How many CPU cores per GPU?

The answer to this question depends on many parameters, and there is no ‘unique’ solution.

Let us first have a look at the maximum number of CPU cores per GPU. This number is implicitly given by the hardware in our GPU servers: the number of GPUs per server, and the number of CPU cores in the server. Most of our GPU servers have two GPUs and two multi-core CPUs, e.g. two Tesla V100 GPUs and two Xeon Gold 6242 CPUs (with 16 cores/CPU). To be able to run two jobs on this node, each using one GPU, the jobs should – as a rule of thumb – not request more than the number of cores of a single CPU, i.e. 16 cores in the example above.

For some of our servers, where we have more GPUs than CPUs, e.g. 4 GPUs and two CPUs, the maximum number of cores is then 1/2 of the cores of a single CPU (or 1/4 of the total number of cores in the server).

Now, let us have a look at the minimum number of cores. This choice is somewhat arbitrary, based on experience, and some other constraints in our cluster setup. The answer here is: 4! This is a ‘hard’ limit, and if you submit a GPU job that request less than 4 cores/GPU, the job will be rejected. In some simple cases, the submit command will assign 4 cores/GPU to your job and submit it to the queue, but it will also ask you to fix your script for future submissions!

Where do I get all the needed information from?

As we have seen above, you will sometimes need to look up the hardware configuration of our GPU servers, to be able to decide which resources to request. Here our nodestat command is the tool of choice. Let’s have a look at our Tesla V100 queue (‘gpuv100’):

$ nodestat -G gpuv100
Node               State   Procs    Load  GPUs  GPU model
n-62-11-13       Running   30:32    2.02   0:2  TeslaV100_PCIE_32GB
n-62-11-14          Idle   32:32    0.59   2:2  TeslaV100_PCIE_32GB
n-62-11-15          Idle   32:32    0.00   2:2  TeslaV100_PCIE_32GB
n-62-11-16       Running   30:32    2.00   0:2  TeslaV100_PCIE_32GB
n-62-20-10          Busy    0:32   10.42   0:4  TeslaV100_SXM2_32GB
n-62-20-11          Busy   32:32    0.00   0:4  TeslaV100_SXM2_32GB
n-62-20-12          Idle   32:32    0.00   4:4  TeslaV100_SXM2_32GB
n-62-20-13       Running   16:24    2.89   0:2  TeslaV100_PCIE_32GB
n-62-20-14          Idle   24:24    0.01   2:2  TeslaV100_PCIE_32GB
n-62-20-15          Idle   24:24    0.10   2:2  TeslaV100_PCIE_32GB
n-62-20-16          Idle   24:24    0.00   2:2  TeslaV100_PCIE_32GB
n-62-20-2        Running   16:24    3.46   0:2  TeslaV100_PCIE_16GB
n-62-20-3        Running   22:24    2.08   0:2  TeslaV100_PCIE_16GB
n-62-20-4        Running   19:24    1.98   0:2  TeslaV100_PCIE_16GB
n-62-20-5        Running   16:24    4.37   0:2  TeslaV100_PCIE_16GB
n-62-20-6        Running   16:24    2.38   0:2  TeslaV100_PCIE_16GB

The interesting columns in the output above are ‘Procs’, ‘GPUs’ and ‘GPU model’. The number of ‘available:installed’ CPU cores is given in the ‘Procs’ column, and the same is true for ‘available:installed’ GPUs in column 5 above. The last column shows the GPU model, that is installed in the server, and in the output above we can see, that there are 3 different models: V100 with 16GB and 32GB GPU memory, connected via PCIe, and three nodes with V100 32GB, that are connected via NVlink (fast GPU-to-GPU communication, here called SXM2).

If you want to go for a single V100 32GB GPU for your job, using "select[gpu32gb]" in your script, then the recommended maximum number of cores would be either 16 (limiting your jobs to 4 possible nodes), or 12. The latter gives the scheduler more nodes to choose from, when dispatching your job, as there are also 4 nodes with 24 CPU cores (12 per CPU).

For the machines with the SXM2 models, there are 4 GPUs and 32 CPU cores. If you go for this model, the recommended max. number is 8 cores, to allow for the best utilization of the nodes.

How many CPU cores should I request now?

As mentioned above, there is no ‘easy answer’ – but we have already taken a choice for you: the minimum number has to be 4 cores/GPU. If you have no experience or idea, if this is enough, stick to this choice. This will also give the biggest flexibility, that the scheduler can find a slot for your GPU job(s).

If you are more experienced and/or know that you have specific requirements, maybe in connection with a specific GPU model, then go for a larger number of cores/GPU – but keep the recommendations from above in mind! As with every other multi-threaded program that runs on a multi-core CPU, there are some effects, that can slow you down, if you assign too many cores to the job – see e.g. our page on “Scaling or not?” for this topic!

Parameters for a simple GPU job

### GPU and CPU related parameters for a simple GPU job script
### select a GPU queue
#BSUB -q GPUqueuename
### request the number of GPUs
#BSUB -gpu "num=1:mode=exclusive_process"
### request the number of CPU cores (at least 4x the number of GPUs)
#BSUB -n 4 
### we want to have this on a single node
#BSUB -R "span[hosts=1]"
### we need to request CPU memory, too (note: this is per CPU core)
#BSUB -R "rusage[mem=8GB]"

The parameters above are a good start for a simple, single-GPU job script (the other parameters for batch jobs can be found on the page about batch jobs, or more specific for GPUs).

How to monitor my GPU jobs?

While CPU only jobs can be monitored with the bstat command, which gives information about the whole lifetime of the job, GPU monitoring is currently only possible in ‘live’ mode, using the bnvtop command. This opens a top-like tool (based on nvtop), and you need to call it with the jobID of your GPU job you want to monitor:

bnvtop JOBID

The output will show the GPU usage in a graphical way, with both the memory and the compute usage of your job, plotted over a period of time.

If both the memory and the compute usage are zero, then your job is not using the GPU at all, and you might need to check your code and/or job settings.
If there is a memory footprint, but no compute usage, then please check your code, if it makes use of the GPU.
If there are long pauses between the compute activities, then you might need to optimize the code that is running on the CPU, to feed the GPU in a better way.

Please do not run bnvtop for a long time – it’s meant as a tool to monitor GPU during a short period of time (minutes), not for longer. Thanks!