By default, LSF handles job exceptions for jobs that exit after they have started running. You can also configure LSF to handle jobs that exit during initialization because of an execution environment problem, or because of a user action or LSF policy.
LSF detects that the jobs are exiting before they actually start running, and takes appropriate action when the job exit rate exceeds the threshold for specific hosts (EXIT_RATE in lsb.hosts) or for all hosts (GLOBAL_EXIT_RATE in lsb.params).
Exit rate type ... |
Includes ... |
---|---|
JOBEXIT |
Local exited jobs Remote job initialization failures Parallel job initialization failures on hosts other than the first execution host Jobs exited by user action (e.g., bkill, bstop, etc.) or LSF policy (e.g., load threshold exceeded, job control action, advance reservation expired, etc.) |
JOBEXIT_NONLSF This is the default when EXIT_RATE_TYPE is not set |
Local exited jobs Remote job initialization failures Parallel job initialization failures on hosts other than the first execution host |
JOBINIT |
Local job initialization failures Parallel job initialization failures on the first execution host |
HPCINIT |
Job initialization failures for HPC jobs |
By default, jobs that are exited for non-host related reasons (user actions and LSF policies) are not counted in the exit rate calculation. Only jobs that are exited for what LSF considers host-related problems and are used to calculate a host exit rate.
bkill, bkill -r
brequeue
RERUNNABLE jobs killed when a host is unavailable
Resource usage limit exceeded (for example, PROCESSLIMIT, CPULIMIT, etc.)
Queue-level job control action TERMINATE and TERMINATE_WHEN
Checkpointing a job with the kill option (bchkpnt -k)
Rerunnable job migration
Job killed when an advance reservation has expired
Remote lease job start fails
Any jobs with an exit code found in SUCCESS_EXIT_VALUES, where a particular exit value is deemed as successful.
To explicitly exclude jobs exited because of user actions or LSF-related policies from the job exit calculation, set EXIT_RATE_TYPE = JOBEXIT_NONLSF in lsb.params. JOBEXIT_NONLSF tells LSF to include all job exits except those that are related to user action or LSF policy. This is the default value for EXIT_RATE_TYPE .
To include all job exit cases in the exit rate count, you must set EXIT_RATE_TYPE = JOBEXIT in lsb.params. JOBEXIT considers all job exits.
Jobs killed by signal external to LSF will still be counted towards exit rate
Jobs killed because of job control SUSPEND action and RESUME action are still counted towards the exit rate. This because LSF cannot distinguish between jobs killed from SUSPEND action and jobs killed by external signals.
If both JOBEXIT and JOBEXIT_NONLSF are defined, JOBEXIT_NONLSF is used.
Host-related failures; for example, incorrect user account, user permissions, incorrect directories for checkpointable jobs, host name resolution failed, or other execution environment problems
Job-related failures; for example, pre-execution or setup problem, job file not created, etc.
By default, or when EXIT_RATE_TYPE=JOBEXIT_NONLSF, job initialization failure on the first execution host does not count in the job exit rate calculation. Job initialization failure for hosts other than the first execution host are counted in the exit rate calculation.
When EXIT_RATE_TYPE=JOBINIT, job initialization failure happens on the first execution host are counted in the job exit rate calculation. Job initialization failures for hosts other than the first execution host are not counted in the exit rate calculation.
For parallel job exit exceptions to be counted for all hosts, specify EXIT_RATE_TYPE=HPCINIT or EXIT_RATE_TYPE=JOBEXIT_NONLSF JOBINIT.
By default, or when EXIT_RATE_TYPE=JOBEXIT_NONLSF, job initialization failures are counted as exited jobs on the remote execution host and are included in the exit rate calculation for that host. To include only local job initialization failures on the execution cluster from the exit rate calculation, set EXIT_RATE_TYPE to include only JOBINIT or HPCINIT.
On large, multiprocessor hosts, use to ENABLE_EXIT_RATE_PER_SLOT=Y in lsb.params to scale the job exit rate so that the host is only closed when the job exit rate is high enough in proportion to the number of processors on the host. This avoids having a relatively low exit rate close a host inappropriately.
Use a float value for GLOBAL_EXIT_RATE in lsb.params to tune the exit rate on multislot hosts. The actual calculated exit rate value is never less than 1.
On a single-processor host, a job exit rate of 5 is much more severe than on a 20-processor host. If a stream of jobs to a single-processor host is consistently failing, it is reasonable to close the host or take some other action after five failures.
On the other hand, for the same stream of jobs on a 20-processor host, it is possible that 19 of the processors are busy doing other work that is running fine. To close this host after only 5 failures would be wrong because effectively less than 5% of the jobs on that host are actually failing.
Using a float value for GLOBAL_EXIT_RATE allows the exit rate to be less than the number of slots on the host. For example, on a host with four slots, GLOBAL_EXIT_RATE=0.25 gives an exit rate of 1. The same value on an eight slot machine would be two, and so on. On a single-slot host, the value is never less than 1.