Monitoring jobs in LSF 10: advanced


LSF native command to get information on the jobs that are running, pending, or finished recently (the last hour) is bjobs. When called without any option, it provides short information about all the user’s running/pending jobs.

bjobs
OBID      USER    QUEUE      JOB_NAME   SLOTS STAT  START_TIME   TIME_LEFT 
45755    s012345   hpc        large_Set      4 RUN   Aug 21 11:00 00:14:41 L 
45756    s012345   hpc        *3_LSF10_1     2 RUN   Aug 21 11:00 24:00:00 L 

You can see all your running jobs (JOBID column), the username (USER column), the queue the job was submitted to (QUEUE column), their jobnames (JOB_NAME column), the number of slots used (for running jobs, SLOTS column), the job status (STAT column), the start time (START_TIME column) and the time to completion (TIME_LEFT).
Some of the common status labels are:

PEND job is queued, waiting to be scheduled
RUN job is running
DONE job is completed after having run
EXIT job exited (after being killed, for example)
SSUSP job is suspended

bjobs can be used to extract a lot more information about your job. For a list of the possible options, type

man bjobs

To get some more information about your jobs, use the command , eventually with the option -l, if you want a verbose output:

bjobs -l <your-job-id>
bjobs -l 45800

Job <45800>, Job Name <Com_wrench_HYBRID>, User <s012345>, Project <default>, Se
                          rvice Class <sc_hpc1>, Mail <s012345@student.dtu.dk>,
 		          Status <RUN>, Queue <hpc>, Command <#!/bin/sh;# embedd
		          ed options to bsub -  #BSUB;# -- job name ---;#BSUB -J
                           Com_wrench_HYBRID;# -- email me at the beginning
                           (b) and end (e) of the execution --;#BSUB -B -N;# --
                           Select queue --;#BSUB -q hpc;# -- My email address -
                          -;#BSUB -u andbor@dtu.dk;# -- estimated wall clock ti
                          me (execution time) --;#BSUB -W  4:00;##BSUB -env "PA
                          TH";# -- parallel environment requests --;#BSUB -n 2;
                          #BSUB -M 5000;#BSUB -R "affinity[core(8)]";### -- Spe
                          cify the output and error file. %J is the job-id -- ;
                          ### -- -o and -e mean append, -oo and -eo mean overwr
                          ite -- ;#BSUB -o Output_%J.out ;#BSUB -e Error_%J.err
                           ; # -- end of LSF options --;  I_MPI_HYDRA_BOOTSTRAP
                          =lsf;I_MPI_DEBUG=0; export I_MPI_HYDRA_BOOTSTRAP I_MP
                          I_DEBUG ;  Num=$OMP_NUM_THREADS; Nodes=$LSB_MAX_NUM_P
                          ROCESSORS;  comsol52 -nn $Nodes -np $Num batch -input
                          file wrench.mph -outputfile wrench_out.mph -tmpdir tm
                          p/ -recoverydir tmp>, Share group charged </userlongs
                          erial>
Mon Aug 21 17:30:17 2017: Submitted from host <hpclogin2>, CWD <$HOME/LSF10/TES
                          TS_Applic_7.3/COMSOL/HYBRID>, Output File <Output_458
                          00.out>, Error File <Error_45800.err>, Notify when jo
                          b begins/ends, 2 Task(s), Requested Resources <affini
                          ty[core(8)] rusage[mem=5000] span[hosts=1]>;

 RUNLIMIT                
 240.0 min of n-62-21-105

 MEMLIMIT
    4.8 G 
Mon Aug 21 17:30:17 2017: Started 2 Task(s) on Host(s) <2*n-62-21-105>, Allocat
                          ed 16 Slot(s) on Host(s) <16*n-62-21-105>, Execution 
                          Home </zhome/xx/x/xxxxx>, Execution CWD </zhome/xx/x/
		          xxxxx/LSF10/TESTS_Applic_7.3/COMSOL/HYBRID>;
Mon Aug 21 17:32:03 2017: Resource usage collected.
                          The CPU time used is 279 seconds.
                          MEM: 1.8 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 62
                          PGID: 15708;  PIDs: 15708 15714 15718 15855 15877 
                          PGID: 15880;  PIDs: 15880 
                          PGID: 15885;  PIDs: 15885 
                          PGID: 15886;  PIDs: 15886 


 MEMORY USAGE:
 MAX MEM: 1.8 Gbytes;  AVG MEM: 1 Gbytes

 PENDING TIME DETAILS:
 Eligible pending time (seconds):       0
 Ineligible pending time (seconds):     0

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[(model = XeonX5550 ||model = XeonE5_2680v2 ||model = XeonE5_2
                          660v3 ||model = XeonE5_2650v4 ) && (type == any)] ord
                          er[-slots:-maxslots] rusage[mem=5000.00] span[hosts=1
                          ] same[type:model] cu[pref=config:maxcus=1:type=rack]
                           affinity[core(8)*1]
 Effective: select[((model = XeonX5550 ||model = XeonE5_2680v2 ||model = XeonE5
                          _2660v3 ||model = XeonE5_2650v4 ) && (type == any))] 
                          order[-slots:-maxslots] rusage[mem=5000.00] span[host
                          s=1] same[type:model] cu[type=rack:maxcus=1:pref=conf
                          ig] affinity[core(8)*1]

The output shows, in a messy format, your original jobscript, and a lot of other information, like the walltime of your job (RUNLIMIT), and how much time has passed;
how many processors you have actually asked for ( Tasks);
the memory used (MEMORYUSAGE).
This information can be useful to check whether your job-script was correct.

Note: you can also get information on a job that is waiting to be dispatched. There will be a section with the reason of the pending status

PENDING REASONS:
 User has reached the per-user job slot limit of the queue;