Configure automatic job requeue

Procedure

To configure automatic job requeue, set REQUEUE_EXIT_VALUES in the queue definition (lsb.queues) or in an application profile (lsb.applications) and specify the exit codes that cause the job to be requeued.

Application-level exit values override queue-level values. Job-level exit values (bsub -Q) override application-level and queue-level values.

Begin Queue 
... 
REQUEUE_EXIT_VALUES = 99 100 
... 
End Queue

This configuration enables jobs that exit with 99 or 100 to be requeued.

Control how many times a job can be requeued

About this task

By default, if a job fails and its exit value falls into REQUEUE_EXIT_VALUES, LSF requeues the job automatically. Jobs that fail repeatedly are requeued indefinitely by default.

Procedure

To limit the number of times a failed job is requeued, set MAX_JOB_REQUEUE cluster wide (lsb.params), in the queue definition (lsb.queues), or in an application profile (lsb.applications).

Specify an integer greater than zero.

MAX_JOB_REQUEUE in lsb.applications overrides lsb.queues, and lsb.queues overrides lsb.params configuration.

Results

When MAX_JOB_REQUEUE is set, if a job fails and its exit value falls into REQUEUE_EXIT_VALUES, the number of times the job has been requeued is increased by 1 and the job is requeued. When the requeue limit is reached, the job is suspended with PSUSP status. If a job fails and its exit value is not specified in REQUEUE_EXIT_VALUES, the job is not requeued.

View the requeue retry limit

Procedure

  1. Run bjobs -l to display the job exit code and reason if the job requeue limit is exceeded.
  2. Run bhist -l to display the exit code and reason for finished jobs if the job requeue limit is exceeded.

Results

The job requeue limit is recovered when LSF is restarted and reconfigured. LSF replays the job requeue limit from the JOB_STATUS event and its pending reason in lsb.events.