-k

Makes a job checkpointable and specifies the checkpoint directory.

Categories

properties

Synopsis

bsub -k "checkpoint_dir [init=initial_checkpoint_period] [checkpoint_period] [method=method_name]"

Description

Specify a relative or absolute path name. The quotes (") are required if you specify a checkpoint period, initial checkpoint period, or custom checkpoint and restart method name.

The job ID and job file name are concatenated to the checkpoint dir when creating a checkpoint file.

Note: The file path of the checkpoint directory can contain up to 4000 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory and file name.

When a job is checkpointed, the checkpoint information is stored in checkpoint_dir/job_ID/file_name. Multiple jobs can checkpoint into the same directory. The system can create multiple files.

The checkpoint directory is used for restarting the job (see brestart(1)). The checkpoint directory can be any valid path.

Optionally, specifies a checkpoint period in minutes. Specify a positive integer. The running job is checkpointed automatically every checkpoint period. The checkpoint period can be changed using bchkpnt. Because checkpointing is a heavyweight operation, you should choose a checkpoint period greater than half an hour.

Optionally, specifies an initial checkpoint period in minutes. Specify a positive integer. The first checkpoint does not happen until the initial period has elapsed. After the first checkpoint, the job checkpoint frequency is controlled by the normal job checkpoint interval.

Optionally, specifies a custom checkpoint and restart method to use with the job. Use method=default to indicate to use the default LSF checkpoint and restart programs for the job, echkpnt.default and erestart.default.

The echkpnt.method_name and erestart.method_name programs must be in LSF_SERVERDIR or in the directory specified by LSB_ECHKPNT_METHOD_DIR (environment variable or set in lsf.conf).

If a custom checkpoint and restart method is already specified with LSB_ECHKPNT_METHOD (environment variable or in lsf.conf), the method you specify with bsub -k overrides this.

Process checkpointing is not available on all host types, and may require linking programs with a special libraries (see libckpt.a(3)). LSF invokes echkpnt (see echkpnt(8)) found in LSF_SERVERDIR to checkpoint the job. You can override the default echkpnt for the job by defining as environment variables or in lsf.conf LSB_ECHKPNT_METHOD and LSB_ECHKPNT_METHOD_DIR to point to your own echkpnt. This allows you to use other checkpointing facilities, including application-level checkpointing.

The checkpoint method directory should be accessible by all users who need to run the custom echkpnt and erestart programs.

Only running members of a chunk job can be checkpointed.