REMOVE_HUNG_JOBS_FOR

Syntax

REMOVE_HUNG_JOBS_FOR = runlimit[,wait_time=min] | host_unavail[,wait_time=min] | runlimit [,wait_time=min]:host_unavail[,wait_time=min] | all[,wait_time=min]

Description

Hung jobs are removed under the following conditions:
  • host_unavail: Hung jobs are automatically removed if the first execution host is unavailable and a timeout is reached as specified by wait_time in the parameter configuration. The default value of wait_time is 10 minutes.

    Hung jobs of any status (RUN, SSUSP, etc.) will be a candidate for removal by LSF when the timeout is reached.

  • runlimit: Remove the hung job after the job’s run limit was reached. You can use the wait_time option to specify a timeout for removal after reaching the runlimit. The default value of wait_time is 10 minutes. For example, if REMOVE_HUNG_JOBS_FOR is defined with runlimit, wait_time=5 and JOB_TERMINATE_INTERVAL is not set, the job is removed by mbatchd 5 minutes after the job runlimit is reached.

    Hung jobs in RUN status are considered for removal if the runlimit + wait_time have expired.

    For backwards compatibility with earlier versions of LSF, REMOVE_HUNG_JOBS_FOR = runlimit is handled as previously: The grace period is 10 mins + MAX(6 seconds, JOB_TERMINATE_INTERVAL) where JOB_TERMINATE_INTERVAL is specified in lsb.params. The grace period only begins once a job’s run limit has been reached.

  • all: Specifies hung job removal for all conditions (both runlimit and host_unavail). The hung job is removed when the first condition is satisfied. For example, if a job has a run limit, but it becomes hung because a host is unavailable before the run limit is reached, jobs (running, suspended, etc.) will be removed after 10 minutes after the host is unavailable. Job is placed in EXIT status by mbatchd.

For a host_unavail condition, wait_time count starts from the moment mbatchd detects that the host is unavail. Running badmin mbdrestart or badmin reconfig while the timeout is in progress will restart the timeout countdown from 0.

For a runlimit condition, wait_time is the time that the job in the UNKNOWN state takes to reach the runlimit.

Default

Not defined.