Managing jobs under LSF 10


Submitting, and checking job status

The most important command is the one for submitting your job script to the queue:

bsub < your_job_script
Job <44951> is submitted to default queue .

The integer number between the brackets is the job-id.
bsub accepts also a lot of options, so that you could in principle avoid to use the script and run a single long command with all the option you need. This is possible, but not recommended.

For checking the available options, just read the man pages:

man bsub

Some of the options are implementation dependent, therefore may not all be available in the DTU/DCC installation.

Once you submit the job, you can still track it. The job is given a job-id, that is shown by the output of the bstat command:

bstat
JOBID      USER    QUEUE      JOB_NAME   NALLOC STAT  START_TIME       ELAPSED
49075      s012345  hpc        O-dir          1 RUN   Aug 22 13:53     0:08:23
49076      s012345  hpc        1-dir          4 RUN   Aug 22 13:53     0:14:20

You can see all your running jobs with their job-id (JOBID column), the user id (USER), the queue name (QUEUE), the job name (JOB_NAME), the number of slots (usually cores) requested (SLOTS), the status (STAT), the start time (START), end the elapsed time (ELAPSED). Some of the common status labels are:

PEND job is queued, waiting to be scheduled
RUN job is running
DONE job is completed after having run
EXIT job exited (after being killed, for example)
SSUSP job is suspended

bstat is a command that we provide in addition to the scheduler native commands, in order to get the most relevant information in compact form.
You can see the recognized options with a short explanation with

bstat -h

Usage: bstat [-h] [-v] [-C | -M] [-u uname] [-q qname] [jobID [jobId ...]]

    -h       : show this message
    -v       : show version history
    -C       : show CPU usage (running jobs, only)
    -M       : show memory usage (running jobs, only)
    -u uname : show information for user "uname" (-u all for all users)
    -q qname : show information for queue "qname"
    jobID    : jobID(s)

    Note: options [-C] and [-M] are mutually exclusive

To check the status of a specific job, use bstat followed by the job-id, for example:

bstat 45678
JOBID      USER    QUEUE      JOB_NAME   NALLOC STAT  START_TIME      ELAPSED
45678      s012345  hpc        Test_bis       1 RUN   Aug 19 12:34    0:12:05

Two useful options for bstat are -C and -M.
With -C you get some extra information on the efficiency of the run:

bstat -C 
JOBID      USER    QUEUE      JOB_NAME   NALLOC     ELAPSED     EFFIC
44436      s012345 hpc    	Test_2       24     64:24:03     99.78

The new column (EFFIC) shows the efficiency of your jobs. The ideal value should be 100%. Low values mean that there could be relevant issues, either with your program, or with the job-script.

With -M you get some extra information about the memory usage of your job:

bstat -M 
JOBID      USER    QUEUE      JOB_NAME   NALLOC      MEM     MAX     AVG     LIM
 44976      s012345 hpc        Test_2        24     18.6G   22.3G    20.1G  24.0G

This shows information about the memory usage of the job: the current memory usage (MEM), the maximum memory used until now (MAX), the average (AVG), and the limit above which the job will be killed.

    The value in the LIM column refers to the total amount of memory used by the job. However, this limit is enforced on a per host basis. For example, let us assume that the limit is 100 GB.
  • If your job is running on a single node, it will be killed if the total amount of memory used by all the job processes exceeds 100 GB.
  • If your job uses span across 2 nodes (with the same number of cores on each node), then the job will be killed when the memory used on one node exceeds 50 GB.
    There can be more complex situations, however.

It is sometimes necessary to remove a job from a queue. This can be done in any stage, i.e. when the program is still waiting to be run (state PEND), or during the run (state RUN). Just get the job id with bjobs, and type

bkill <your-job-id>
  • Sometimes you need to send a specific SIGNAL to your program, for example to enable a clean shutdown. In that case, you can send the specific signal (for example SIGTERM): bkill -s SIGTERM

Some examples of the usage of the native LSF command to monitor jobs can be found here.

Additional useful commands

classstat:

shows the current status of the cluster(s): total number of jobs cores used, available, and the number of pending jobs waiting to be run. In addition, it shows the number of jobs migrated to and from other queues.

classstat hpc
queue                total  used avail  pend j-mig j-abs
--------------------------------------------------------
hpc                   1376   478   898   216   0     0

nodestat:

shows the current status of the nodes in the clusters: Status, Number of cores available (free:total), the current load. For example, for the hpc queue:

nodestat hpc
Node                   State   Procs    Load
n-62-13-9               Idle    8:8     0.30
n-62-21-10              Idle   24:24    0.00
n-62-21-100             Busy    0:24   24.00
n-62-21-101             Busy    0:24   24.00

It can be used also to see the machine “model” (cpu architecture), the memory present on the machine, and other features, like the instruction set supported (option -F), to be used in the scripts to select specific features:

nodestat -F hpc
Node               State   Procs    Load  Model             Memory  Feature
n-62-13-9           Idle    8:8     0.00  XeonX5550          24 GB  ()
n-62-21-10          Idle   24:24    0.00  XeonE5_2650v4     256 GB  (avx avx2)
n-62-21-100         Busy    0:24   24.06  XeonE5_2650v4     256 GB  (avx avx2)
n-62-21-101         Busy    0:24   24.00  XeonE5_2650v4     256 GB  (avx avx2)
...

The command also shows detail on the GPUs if present of the machine (options -g and -G)

For getting help:

nodestat -h

showstart:

It shows the estimated start time for pending jobs. If invoked without arguments, it provides the estimated start time for all user’s pending jobs.

showstart 758888
JOBID    USER     QUEUE    SUBMIT_TIME   ESTIMATED_SIM_START_T JOB_NAME       
758888   s012345  hpc      Jan 26 10:56  Jan 28 22:00:01 2018  Exp_2

Add a valid job-id to the command to see the estimate for the specific job only. Additional options can be found with

showstart -h

bhist:

displays historical information about job. To be used to check for jobs, even after they are finished. It gives all the information about the job.

bhist -a

gives a compact list of all the user’s jobs.

bhist -t -T .-2,

gives info on the jobs that run in the last 2 days.

bhist -l <jobid>
Job <jobid>, Job Name <my_job>,...
Thu Jul 30 16:05:18: Submitted from host , to Queue , CWD <$HOME/TEST_LSF/>, 
                     Output File (overwrite) , Error File (overwrite) , 8 Task(s), Requested Resources ;

 RUNLIMIT                
 150.0 min of hpclogin2
Thu Jul 30 16:05:18: Dispatched 8 Task(s) on Host(s) <4*n-62-13-1>...;
...
Thu Jul 30 16:15:58: Done successfully. The CPU time used is 5289.9 seconds;
                     HOST: n-62-13-1; CPU_TIME: 2706 seconds
                     HOST: n-62-13-2; CPU_TIME: 2540 seconds
Thu Jul 30 16:15:59: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 174 Mbytes;  AVG MEM: 159 Mbytes
... 

gives info on the specified job.

bpeek:

displays the standard output and standard error output produced by unfinished jobs, up to the time that it is invoked.

bpeek <jobid>
<< output from stdout >>
.......
<< output from stderr >>
.....

bacct:

displays a summary of accounting statistics for all finished jobs (with a DONE or EXIT status) submitted by the user. Most useful to get report information on a specific job:

bacct <jobid>
Accounting information about jobs that are: 
  - submitted by all users.
  - accounted on all projects.
  - completed normally or exited
  - executed on all hosts.
  - submitted to all queues.
  - accounted on all service classes.
------------------------------------------------------------------------------

SUMMARY:      ( time unit: second ) 
 Total number of done jobs:       0      Total number of exited jobs:     1
 Total CPU time consumed:   68852.8      Average CPU time consumed: 68852.8
 Maximum CPU time of a job: 68852.8      Minimum CPU time of a job: 68852.8
 Total wait time in queues:     1.0
 Average wait time in queue:    1.0
 Maximum wait time in queue:    1.0      Minimum wait time in queue:    1.0
 Average turnaround time:     66182 (seconds/job)
 Maximum turnaround time:     66182      Minimum turnaround time:     66182
 Average hog factor of a job:  1.04 ( cpu time / turnaround time )
 Maximum hog factor of a job:  1.04      Minimum hog factor of a job:  1.04
 Average expansion factor of a job:  1.00 ( turnaround time / run time )
 Maximum expansion factor of a job:  1.00
 Minimum expansion factor of a job:  1.00
 Total Run time consumed:     66181      Average Run time consumed:   66181
 Maximum Run time of a job:   66181      Minimum Run time of a job:   66181