Remove hung jobs from LSF

About this task

A dispatched job becomes hung if its execution host (or first execution host for parallel jobs) goes to either unreach or unavail state. For jobs with a specified runlimit, LSF considers a job to be hung once the runlimit expires and mbatchd attempts to signal sbatchd to kill the job, but sbatchd is unable to kill the job.

During this time, any resources on other hosts held by the job are unavailable to the cluster for use by other pending jobs. This results in poor utilization of cluster resources. It is possible to manually remove hung jobs with bkill –r, but this requires LSF administrators to actively monitor for jobs in UNKNOWN state. Instead of manually removing jobs or waiting for the hosts to come back, LSF can automatically terminate the job after reaching a timeout. After removing the job, LSF moves the job to the EXIT state to free up resources for other workload, and logs a message in the mbatchd log file.

Jobs with a runlimit specified may hang for the following reasons:

Host status is unreach: sbatchd on the execution host (or first execution host for parallel jobs) is down.

Jobs running on an execution host when sbatchd goes down go into the UNKNOWN state. These UNKNOWN jobs continue to occupy shared resources, making the shared resources unavailable for other jobs.
Host status is unavail: sbatchd and LIM on the execution host (or first execution host for parallel jobs) are down (that is, the host status is unavail). Jobs running on an execution host when sbatchd and LIM go down go into the UNKNOWN state.
Reasons specific to the operating system on the execution host.

Jobs that cannot be killed due to an issue with the operating system remain in the RUN state even after the run limit has expired.

To enable hung job management, set the REMOVE_HUNG_JOBS_FOR parameter in lsb.params. When REMOVE_HUNG_JOBS_FOR is set, LSF automatically removes hung jobs and frees host resources for other workload. An optional timeout can also be specified for hung job removal. Hung jobs are removed under the following conditions:

HOST_UNAVAIL: Hung jobs are automatically removed if the first execution host is unavailable and a timeout is reached as specified by wait_time in the parameter configuration. The default value of wait_time is 10 minutes.

Hung jobs of any status will be a candidate for removal by LSF when the timeout is reached.
runlimit: Remove the hung job after the job’s run limit was reached. You can use the wait_time option to specify a timeout for removal after reaching the runlimit. The default value of wait_time is 10 minutes. For example, if REMOVE_HUNG_JOBS_FOR is defined with runlimit, wait_time=5 and JOB_TERMINATE_INTERVAL is not set, the job is removed by mbatchd 5 minutes after the job runlimit is reached.

Hung jobs in RUN status are considered for removal if the runlimit + wait_time have expired.

For backwards compatibility with earlier versions of LSF, REMOVE_HUNG_JOBS_FOR = runlimit is handled as previously: The grace period is 10 mins + MAX(6 seconds, JOB_TERMINATE_INTERVAL) where JOB_TERMINATE_INTERVAL is specified in lsb.params. The grace period only begins once a job’s run limit has been reached.
ALL: Specifies hung job removal for all conditions (both runlimit and host_unavail). The hung job is removed when the first condition is satisfied. For example, if a job has a run limit, but it becomes hung because a host is unavailable before the run limit is reached, jobs (running, suspended, etc.) will be removed after 10 minutes after the host is unavailable. Job is placed in EXIT status by mbatchd.

The output for hung job removal can be shown with the bhist command. For example:

Job <5293>, User <user1>, Project <default>, Job Group </default/user1>,
                          Command <sleep 1000>
Tue May 21 00:59:43 2013: Submitted from host <hostA>, to Queue <normal>, CWD
                          <$HOME>, Specified Hosts <abc210>;
Tue May 21 00:59:44 2013: Dispatched to <abc210>, Effective RES_REQ <select
                          [type == any] order[r15s:pg] >;
Tue May 21 00:59:44 2013: Starting (Pid 27216);
Tue May 21 00:59:49 2013: Running with execution home </home/user1>, Execution
                          CWD </home/user1>, Execution Pid <27216>;
Tue May 21 01:05:59 2013: Unknown; unable to reach the execution host;
Tue May 21 01:10:59 2013: Exited; job has been forced to exit with exit code 2.
                          The CPU time used is unknown;
Tue May 21 01:10:59 2013: Completed <exit>; TERM_REMOVE_HUNG_JOB: job removed from the
LSF system
 
Summary of time in seconds spent in various states by Tue May 21 13:23:06 2013
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  44147    0        375      0        0        81       44603

Where exit code 1 is for jobs removed by the runlimit condition and exit code 2 is for those removed by the host_unavail condition.

When defining REMOVE_HUNG_JOBS_FOR, note the following:

mbatchd restart and badmin reconfig will reset the timeout value for jobs with a HOST_UNAVAIL condition.
Rerunnable jobs are not removed from LSF since they can be dispatched to other hosts.
The job exit rate for a hung job is considered in the exit rate calculation when the exit rate type is JOBEXIT.
mbatchd removes entire running chunk jobs and waiting chunk jobs if a HOST_UNAVAIL condition is satisfied. If a runlimit condition is satisfied, only RUNNING or UNKNOWN members of chunk jobs will be removed.
In MultiCluster mode, an unavailable host condition (HOST_UNAVAIL) works for local hosts and jobs. The forwarded job is handled by the execution cluster depending on how REMOVE_HUNG_JOBS_FOR is configured in the execution cluster.
When the LSF Advanced Edition LSF/XL feature is defined, if the remote host is unavailable, mbatchd removes the job based on the timeout value specified in the execution cluster.
If both HOST_UNAVAIL and runlimit are defined (or ALL), the job is removed for whichever condition is satisfied first.

.