Submitting, and checking job status
The most important command is the one for submitting your job script to the queue:
bsub < your_job_script Job <44951> is submitted to default queue .
The integer number between the brackets is the job-id.
bsub
accepts also a lot of options, so that you could in principle avoid to use the script and run a single long command with all the option you need. This is possible, but not recommended.
For checking the available options, just read the man pages:
man bsub
Some of the options are implementation dependent, therefore may not all be available in the DTU/DCC installation.
Once you submit the job, you can still track it. The job is given a job-id, that is shown by the output of the bstat
command:
bstat JOBID USER QUEUE JOB_NAME NALLOC STAT START_TIME ELAPSED 49075 s012345 hpc O-dir 1 RUN Aug 22 13:53 0:08:23 49076 s012345 hpc 1-dir 4 RUN Aug 22 13:53 0:14:20
You can see all your running jobs with their job-id (JOBID column), the user id (USER), the queue name (QUEUE), the job name (JOB_NAME), the number of slots (usually cores) requested (SLOTS), the status (STAT), the start time (START), end the elapsed time (ELAPSED). Some of the common status labels are:
PEND job is queued, waiting to be scheduled
RUN job is running
DONE job is completed after having run
EXIT job exited (after being killed, for example)
SSUSP job is suspended
bstat
is a command that we provide in addition to the scheduler native commands, in order to get the most relevant information in compact form.
You can see the recognized options with a short explanation with
bstat -h Usage: bstat [-h] [-v] [-C | -M] [-u uname] [-q qname] [jobID [jobId ...]] -h : show this message -v : show version history -C : show CPU usage (running jobs, only) -M : show memory usage (running jobs, only) -u uname : show information for user "uname" (-u all for all users) -q qname : show information for queue "qname" jobID : jobID(s) Note: options [-C] and [-M] are mutually exclusive
To check the status of a specific job, use bstat
followed by the job-id, for example:
bstat 45678 JOBID USER QUEUE JOB_NAME NALLOC STAT START_TIME ELAPSED 45678 s012345 hpc Test_bis 1 RUN Aug 19 12:34 0:12:05
Two useful options for bstat
are -C
and -M
.
With -C
you get some extra information on the efficiency of the run:
bstat -C JOBID USER QUEUE JOB_NAME NALLOC ELAPSED EFFIC 44436 s012345 hpc Test_2 24 64:24:03 99.78
The new column (EFFIC) shows the efficiency of your jobs. The ideal value should be 100%. Low values mean that there could be relevant issues, either with your program, or with the job-script.
With -M
you get some extra information about the memory usage of your job:
bstat -M JOBID USER QUEUE JOB_NAME NALLOC MEM MAX AVG LIM 44976 s012345 hpc Test_2 24 18.6G 22.3G 20.1G 24.0G
This shows information about the memory usage of the job: the current memory usage (MEM), the maximum memory used until now (MAX), the average (AVG), and the limit above which the job will be killed.
The value in the LIM column refers to the total amount of memory used by the job. However, this limit is enforced on a per host basis. For example, let us assume that the limit is 100 GB.
- If your job is running on a single node, it will be killed if the total amount of memory used by all the job processes exceeds 100 GB.
- If your job uses span across 2 nodes (with the same number of cores on each node), then the job will be killed when the memory used on one node exceeds 50 GB.
There can be more complex situations, however.
It is sometimes necessary to remove a job from a queue. This can be done in any stage, i.e. when the program is still waiting to be run (state PEND), or during the run (state RUN). Just get the job id with bjobs, and type
bkill <your-job-id>
- Sometimes you need to send a specific SIGNAL to your program, for example to enable a clean shutdown. In that case, you can send the specific signal (for example SIGTERM):
bkill -s SIGTERM
Some examples of the usage of the native LSF command to monitor jobs can be found here.
Additional useful commands
classstat:
shows the current status of the cluster(s): total number of jobs cores used, available, and the number of pending jobs waiting to be run. In addition, it shows the number of jobs migrated to and from other queues.
classstat hpc queue total used avail pend j-mig j-abs -------------------------------------------------------- hpc 1376 478 898 216 0 0
nodestat:
shows the current status of the nodes in the clusters: Status, Number of cores available (free:total), the current load. For example, for the hpc queue:
nodestat hpc Node State Procs Load n-62-13-9 Idle 8:8 0.30 n-62-21-10 Idle 24:24 0.00 n-62-21-100 Busy 0:24 24.00 n-62-21-101 Busy 0:24 24.00
It can be used also to see the machine “model” (cpu architecture), the memory present on the machine, and other features, like the instruction set supported (option -F), to be used in the scripts to select specific features:
nodestat -F hpc Node State Procs Load Model Memory Feature n-62-13-9 Idle 8:8 0.00 XeonX5550 24 GB () n-62-21-10 Idle 24:24 0.00 XeonE5_2650v4 256 GB (avx avx2) n-62-21-100 Busy 0:24 24.06 XeonE5_2650v4 256 GB (avx avx2) n-62-21-101 Busy 0:24 24.00 XeonE5_2650v4 256 GB (avx avx2) ...
The command also shows detail on the GPUs if present of the machine (options -g
and -G
)
For getting help:
nodestat -h
showstart:
It shows the estimated start time for pending jobs. If invoked without arguments, it provides the estimated start time for all user’s pending jobs.
showstart 758888 JOBID USER QUEUE SUBMIT_TIME ESTIMATED_SIM_START_T JOB_NAME 758888 s012345 hpc Jan 26 10:56 Jan 28 22:00:01 2018 Exp_2
Add a valid job-id to the command to see the estimate for the specific job only. Additional options can be found with
showstart -h
bhist:
displays historical information about job. To be used to check for jobs, even after they are finished. It gives all the information about the job.
bhist -a
gives a compact list of all the user’s jobs.
bhist -t -T .-2,
gives info on the jobs that run in the last 2 days.
bhist -l <jobid> Job <jobid>, Job Name <my_job>,... Thu Jul 30 16:05:18: Submitted from host , to Queue , CWD <$HOME/TEST_LSF/>, Output File (overwrite) , Error File (overwrite) , 8 Task(s), Requested Resources ; RUNLIMIT 150.0 min of hpclogin2 Thu Jul 30 16:05:18: Dispatched 8 Task(s) on Host(s) <4*n-62-13-1>...; ... Thu Jul 30 16:15:58: Done successfully. The CPU time used is 5289.9 seconds; HOST: n-62-13-1; CPU_TIME: 2706 seconds HOST: n-62-13-2; CPU_TIME: 2540 seconds Thu Jul 30 16:15:59: Post job process done successfully; MEMORY USAGE: MAX MEM: 174 Mbytes; AVG MEM: 159 Mbytes ...
gives info on the specified job.
bpeek:
displays the standard output and standard error output produced by unfinished jobs, up to the time that it is invoked.
bpeek <jobid> << output from stdout >> ....... << output from stderr >> .....
bacct:
displays a summary of accounting statistics for all finished jobs (with a DONE or EXIT status) submitted by the user. Most useful to get report information on a specific job:
bacct <jobid> Accounting information about jobs that are: - submitted by all users. - accounted on all projects. - completed normally or exited - executed on all hosts. - submitted to all queues. - accounted on all service classes. ------------------------------------------------------------------------------ SUMMARY: ( time unit: second ) Total number of done jobs: 0 Total number of exited jobs: 1 Total CPU time consumed: 68852.8 Average CPU time consumed: 68852.8 Maximum CPU time of a job: 68852.8 Minimum CPU time of a job: 68852.8 Total wait time in queues: 1.0 Average wait time in queue: 1.0 Maximum wait time in queue: 1.0 Minimum wait time in queue: 1.0 Average turnaround time: 66182 (seconds/job) Maximum turnaround time: 66182 Minimum turnaround time: 66182 Average hog factor of a job: 1.04 ( cpu time / turnaround time ) Maximum hog factor of a job: 1.04 Minimum hog factor of a job: 1.04 Average expansion factor of a job: 1.00 ( turnaround time / run time ) Maximum expansion factor of a job: 1.00 Minimum expansion factor of a job: 1.00 Total Run time consumed: 66181 Average Run time consumed: 66181 Maximum Run time of a job: 66181 Minimum Run time of a job: 66181