How LSF submits and controls chunk jobs

When a job is submitted to a queue or application profile that is configured with the CHUNK_JOB_SIZE parameter, LSF attempts to place the job in an existing chunk. A job is added to an existing chunk if it has the same characteristics as the first job in the chunk:
  • Submitting user

  • Resource requirements

  • Host requirements

  • Queue or application profile

  • Job priority

If a suitable host is found to run the job, but there is no chunk available with the same characteristics, LSF creates a new chunk.

Resources reserved for any member of the chunk are reserved at the time the chunk is dispatched and held until the whole chunk finishes running. Other jobs requiring the same resources are not dispatched until the chunk job is done.

WAIT status

When sbatchd receives a chunk job, it does not start all member jobs at once. A chunk job occupies a single job slot. Even if other slots are available, the chunk job members must run one at a time in the job slot they occupy. The remaining jobs in the chunk that are waiting to run are displayed as WAIT by bjobs. Any jobs in WAIT status are included in the count of pending jobs by bqueues and busers. The bhosts command shows the single job slot occupied by the entire chunk job in the number of jobs shown in the NJOBS column.

The bhist -l command shows jobs in WAIT status as Waiting ...

The bjobs -l command does not display a WAIT reason in the list of pending jobs.

Control chunk jobs

Job controls affect the state of the members of a chunk job. You can perform the following actions on jobs in a chunk job:

Action (Command)

Job State

Effect on Job (State)

Suspend (bstop)

PEND

Removed from chunk (PSUSP)

RUN

All jobs in the chunk are suspended (NRUN -1, NSUSP +1)

USUSP

No change

WAIT

Removed from chunk (PSUSP)

Kill (bkill)

PEND

Removed from chunk (NJOBS -1, PEND -1)

RUN

Job finishes, next job in the chunk starts if one exists (NJOBS -1, PEND -1)

USUSP

Job finishes, next job in the chunk starts if one exists (NJOBS -1, PEND -1, SUSP -1, RUN +1)

WAIT

Job finishes (NJOBS-1, PEND -1)

Resume (bresume)

USUSP

Entire chunk is resumed (RUN +1, USUSP -1)

Migrate (bmig)

WAIT

Removed from chunk

Switch queue (bswitch)

RUN

Job is removed from the chunk and switched; all other WAIT jobs are requeued to PEND

WAIT

Only the WAIT job is removed from the chunk and switched, and requeued to PEND

Checkpoint (bchkpnt)

RUN

Job is checkpointed normally

Modify (bmod)

PEND

Removed from the chunk to be scheduled later

Migrating jobs with bmig changes the dispatch sequence of the chunk job members. They are not redispatched in the order they were originally submitted.

Rerunnable chunk jobs

If the execution host becomes unavailable, rerunnable chunk job members are removed from the queue and dispatched to a different execution host.

Checkpoint chunk jobs

Only running chunk jobs can be checkpointed. If bchkpnt -k is used, the job is also killed after the checkpoint file has been created. If chunk job in WAIT state is checkpointed, mbatchd rejects the checkpoint request.

Fairshare policies and chunk jobs

Fairshare queues can use job chunking. Jobs are accumulated in the chunk job so that priority is assigned to jobs correctly according to the fairshare policy that applies to each user. Jobs belonging to other users are dispatched in other chunks.

TERMINATE_WHEN job control action

If the TERMINATE_WHEN job control action is applied to a chunk job, sbatchd kills the chunk job element that is running and puts the rest of the waiting elements into pending state to be rescheduled later.