Terminate Orphan Jobs

When one job depends on the result of another job and the dependency condition is never satisfied, the dependent job never runs and remains in the system as an orphan job. LSF can automatically terminate jobs that are orphaned when a job they depend on fails. Start of change Orphan job termination is a fully supported solution for LSF 9.1.2. Download the solution named lsf-9.1.2-build229753 from the IBM Support Portal (http://www.ibm.com/eserver/support/fixes/). End of change

About orphan job termination

Often, complex workflows are required with job dependencies for proper job sequencing as well as job failure handling. For a given job, called the parent job, there can be child jobs which depend on its state before they can start. If one or more conditions are not satisfied, a child job remains pending. However, if the parent job is in a state such that the event on which the child depends will never occur, the child becomes an orphan job. For example, if a child job has a DONE dependency on the parent job but the parent ends abnormally, the child will never run as a result of the parent’s completion and becomes an orphan job.

In some cases there may be a large number of jobs submitted but many will never run because they require dependency conditions that were never satisfied. Similarly, you may submit job A to do some pre-calculation, and job B may consist of hundreds of analysis jobs that depend on job A generating inputs. If job A fails, hundreds of jobs wait for a condition that will never be true. As such, they become orphan jobs and remain pending in the LSF system.

Keeping orphan jobs in the system can cause performance degradation. The pending orphan jobs consume unnecessary system resources and add unnecessary loads to the daemons which can impact their ability to do useful work. You could use external scripts for monitoring and terminating orphan jobs, but that would add more work to mbatchd.

Enable orphan job termination two ways:
  • An LSF administrator enables the feature at the cluster level by defining a cluster-wide termination grace period with the parameter ORPHAN_JOB_TERM_GRACE_PERIOD in lsb.params. The cluster-wide termination grace period applies to all dependent jobs in the cluster.

  • Users can use the -ti suboption of jobs with job dependencies specified by bsub -w to enforce immediate automatic orphan termination on a per-job basis even if the feature is disabled at the cluster level. Dependent jobs submitted with this option that later become orphans are subject to immediate termination without the grace period even if it is defined.

Define a cluster-wide termination grace period

To avoid prematurely killing dependent jobs that users may still want to keep, LSF terminates a dependent job only after at least a configurable grace period has elapsed. The orphan termination grace period is defined as the minimum amount of time - starting from the point when a child job’s dependency has become not valid – that the child job must wait before it is eligible for automatic orphan termination.

mbatchd periodically scans the job list and determines jobs for which the dependencies can never be met. The number of job dependencies to evaluate per session is controlled by the cluster-wide parameter EVALUATE_JOB_DEPENDENCY. If an orphan is detected and it meets the grace period criteria, mbatchd kills the orphan as part of dependency evaluation processing.

Due to various implementation and run-time factors (such as how busy mbatchd is serving other requests), the actual elapsed time prior to automatically killing dependent jobs can be longer than the specified grace period. But LSF ensures the dependent jobs are terminated only after at least the grace period has elapsed.

For multiple dependent jobs in a dependency tree, the grace period is not repeated at each dependency level. This avoids taking an extremely long time to terminate all dependent jobs in a large dependency tree. When a job is killed, its entire sub-tree of orphaned dependents can be killed after the grace period is expired.

The elapsed time for ORPHAN_JOB_TERM_GRACE_PERIOD is carried over after a restart, so that the set time for ORPHAN_JOB_TERM_GRACE_PERIOD is not restarted when LSF restarts.

For example, to use a cluster-wide termination grace period:

  1. Set ORPHAN_JOB_TERM_GRACE_PERIOD=90.

  2. Run badmin reconfig to have the changes take effect.

  3. Submit a parent job. For example:

    bsub -J "JobA" sleep 100

  4. Submit child jobs. For example:

    bsub -w "done(JobA)" sleep 100

  5. (Optional) Use commands such as bjobs -l, bhist -l or bparams -l to query orphan termination settings. For example:

    bparams -l
    Grace period for the automatic termination of orphan jobs:
    ORPHAN_JOB_TERM_GRACE_PERIOD = 90 (seconds)
  6. The parent job is killed. Some orphan jobs must wait for the grace period to expire before they can be terminated by LSF.

  7. Use commands such as bjobs -l, bhist -l or bacct -l to query orphaned jobs terminated by LSF. For example:

    bacct –l <dependent job ID/name>:
    Job <job ID>, User <user1>, Project <default>, Status <EXIT>, Queue <normal>,
    Command <sleep 100>
    Thu Jan 23 14:26:27: Submitted from host <hostA>, CWD <$HOME/lsfcluster/conf>;
    Thu Jan 23 14:26:56: Completed <exit>; TERM_ORPHAN_SYSTEM: orphaned job
                                           terminated automatically by LSF.
     
    Accounting information about this job:
         CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP
          0.00       29             29     exit         0.0000     0M      0M

    Note that if running bhist with versions of LSF prior to 9.1.3, you'll see Signal <KILL> requested by user or administrator <system>. This is equivalent to Signal <KILL> requested by LSF on LSF 9.1.3 and means the orphan job was terminated automatically by LSF.

Enforce automatic orphan termination on a per-job basis

A -ti sub option of -w for bsub (i.e., bsub -w 'dependency_expression' [-ti]) allows users to indicate that a job is eligible for automatic and immediate termination by the system as soon as the job is found to be an orphan, without waiting for the grace period to expire. The behavior is enforced even if automatic orphan termination is not enabled at the cluster level. This is useful if a user doesn’t want to use the grace period set by the administrator or if the feature is not enabled in the cluster to allow jobs to be terminated automatically by default.

Note that for bmod, -ti is a command option, not a sub-option, and you do not need to re-specify the original dependency expression from the -w option submitted with bsub.

This is also useful in the design of experimental scenarios where a job will spawn additional jobs to self propagate through a problem, similar to solving a maze. When a junction is reached, new jobs are spawned to search each possible direction, and keep repeating for each junction. From one initial job you can get a complex tree structure until one job reaches a solution. At that point all the other jobs are not needed. If you kill the other running jobs, all their dependent jobs are orphaned, and should be terminated.

With the -ti option, LSF only terminates a job as soon as mbatchd can detect it, evaluate its dependency and determine it to be an orphan. This means you may not see the job terminate immediately.

For example, to enforce automatic orphan job termination on a per-job basis:

  1. Submit a parent job. For example:

    bsub -J "JobA" sleep 100

  2. Submit child jobs with the -ti option to ignore the grace period. For example:

    bsub -w "done(JobA)" -J "JobB" -ti sleep 100

  3. (Optional) Use commands such as bjobs -l or bhist -l to query orphan termination settings. For example:

    bhist –l <dependent job ID/name>:
    Job <135>, Job Name <JobB>, User <user1>, Project <default>, Command <sleep 100>
    Thu Jan 23 13:25:35: Submitted from host <hostA>, to Queue <normal>, CWD
                         <$HOME/lsfcluster/conf>, Dependency Condition <done(JobA)>
                         - immediate orphan termination for job <Y>;
  4. The parent job is killed. LSF immediately and automatically kills the orphan jobs submitted with the -ti sub-option.

  5. Use commands such as bjobs -l or bhist -l to query orphaned jobs terminated by LSF. For example:

    bjobs –l <dependent job ID/name>:
    Job <135>, Job Name <JobB>, User <user1>, Project <default>, Status <EXIT>,
    Queueue <normal>, Command <sleep 100>
    Thu Jan 23 13:25:42: Submitted from host <hostA>, CWD <$HOME/lsfcluster/conf/
                         sbatch/lsfcluster/configdir>, Dependency Condition 
                         <done(JobA)> - immediate orphan termination for job <Y>;
    Thu Jan 23 13:25:49: Exited
    Thu Jan 23 13:25:49: Completed <exit>; TERM_ORPHAN_SYSTEM:
                         orphaned job terminated automatically by LSF.

How LSF uses automatic orphan job termination

  • LSF takes a best-effort approach to discovering orphaned jobs in a cluster, meaning that there is no guarantee that all jobs whose dependencies can never be satisfied are identified and reported as orphans.

  • Orphan jobs terminated automatically by LSF are logged in lsb.events and lsb.acct. For example, you may see the following in lsb.events:

    JOB_SIGNAL" "9.12" 1390855455 9431 -1 1 "KILL" 0 "system" "" -1 "" -1

  • Similar to -w, the -ti sub-option is not valid for a forwarded remote job.

  • For automatic orphan termination, if the dependency was specified with a job name and there are multiple jobs with the same name, evaluating the status of a child job depends on the JOB_DEP_LAST_SUB parameter:

    • If set to 1, a child job's dependency is evaluated based on the most recently submitted parent job with that name. So killing an older parent with that job name does not affect the child and does not cause it to become an orphan.

    • If not set, a child job's dependency is evaluated based on all previous parent jobs with that name. So killing any previous parent with that job name impacts the child job and causes it to become an orphan.

  • When manually requeuing a running, user-suspended, or system-suspended parent job, the automatic orphan termination mechanism will not prematurely terminate temporary orphans.

    When manually requeuing an exited or done parent job, the job’s dependents may have become orphans and terminated automatically. You must requeue the parent job and any terminated orphan jobs to restart the job flow.

    If automatic requeue is configured for a parent job which has dependents, when the parent job finishes, the automatic orphan termination feature will not prematurely terminate its temporary orphan jobs while the parent job is requeued.

  • When using bjdepinfo, note that it does not consider the running state of the dependent job. It is based on the current dependency evaluation. You can get a reason such as is invalid, never satisfied or not satisfied even for a running or finished job.

  • If a parent job is checkpointed, its dependents may become orphans. As a result, if automatic orphan termination is enabled, these orphans can be terminated by LSF before the user restarts the parent job.

  • Orphan jobs terminated by the system automatically are logged with the exit code TERM_ORPHAN_SYSTEM and cleaned from mbatchd memory after the time interval specified by CLEAN_PERIOD.