Job workflow, or how to setup and monitor a job


In the usage of a cluster, the job submission is not necessarily the last step. 
A good efficiency of the job is in the interest of every user, so that the results can be obtained faster. 
But it is also in the interest of all the users of the cluster: if the resources are used efficiently, then more resources are available for all the users.

In this page we give some hints on what to do before the job submission, to prepare a jobscript that is suitable for the user’s needs, and after the job submission, to make sure that the job is running as expected.

Before the job submission: estimating the resources needed

When writing a jobscript, the user have to decide how many resources to use. The most relevant are

  • CPU cores/number of nodes
  • GPU 
  • total amount of RAM 
  • walltime

In an ideal world the user should know precisely what the program  to run is capable of, and what resources it needs. But in reality, sometimes it is a matter of testing.

CPU cores/number of nodes

GPU

Total amount or RAM 

walltime

When the job is running

We collected the commands to check the status of a job here. In this section we want to provide some hints on how to use them to investigate the job performance. Two critical metrics that need to be checked are:

  • CPU core utilization: are all the cores used all of the time?
  • Memory utilization: is the amount of memory used  close to the amount of memory reserved?
  • GPU utilization: are the GPUs used efficiently?

A bad utilization of the CPUs, GPUs and memory is a waste of the resources affecting all the users. 

We provide a few command line tools to investigate the job efficiency, here we give some hints on how to use them.

CPU cores utilization

Memory utilization

GPU utilization

After the job finished

After the job terminates, the scheduler produces a report with some important pieces of information regarding the job. This can be used to evaluate if the original assumptions about the job requirements were adequate or not, and adjust them for future jobs.

Here is an example of a snippet from the last part of the report:

Resource usage summary:

    CPU time :                                   24524.00 sec.
    Max Memory :                                 2265 MB
    Average Memory :                             2145.12 MB
    Total Requested Memory :                     131072.00 MB
    Delta Memory :                               128807.00 MB
    Max Swap :                                   1 MB
    Max Processes :                              21
    Max Threads :                                57
    Run time :                                   3163 sec.
    Turnaround time :                            3100 sec.

Read file  for stdout output of this job.
Read file  for stderr output of this job.

Highlighted in colours are the sections with information for the efficiency, memory usage and hints at oversubscription.

Efficiency

Memory usage

Oversubscription