Configuration to modify job checkpoint and restart

There are configuration parameters that modify various aspects of job checkpoint and restart behavior by:
  • Specifying mandatory application-level checkpoint and restart executables that apply to all checkpointable batch jobs in the cluster

  • Specifying the directory that contains customized application-level checkpoint and restart executables

  • Saving standard output and standard error to files in the checkpoint directory

  • Automatically checkpointing jobs before suspending or terminating them

  • For Cray systems only, copying all open job files to the checkpoint directory

Configuration to specify mandatory application-level executables

You can specify mandatory checkpoint and restart executables by defining the parameter LSB_ECHKPNT_METHOD in lsf.conf or as an environment variable.

Configuration file

Parameter and syntax

Behavior

lsf.conf

LSB_ECHKPNT_METHOD=

"echkpnt_application"

  • The specified echkpnt runs for all batch jobs submitted to the cluster. At restart, the corresponding erestart runs.

  • For example, if LSB_ECHKPNT_METHOD=fluent, at checkpoint, LSF runs echkpnt.fluent and at restart, LSF runs erestart.fluent.

  • If an LSF user specifies a different echkpnt_application at the job level using bsub -k or bmod -k, the job level value overrides the value in lsf.conf.

Configuration to specify the directory for application-level executables

By default, LSF looks for application-level checkpoint and restart executables in LSF_SERVERDIR. You can modify this behavior by specifying a different directory as an environment variable or in lsf.conf.

Configuration file

Parameter and syntax

Behavior

lsf.conf

LSB_ECHKPNT_METHOD_DIR=path

  • Specifies the absolute path to the directory that contains the echkpnt.application and erestart.application executables

  • User accounts that run these executables must have the correct permissions for the LSB_ECHKPNT_METHOD_DIR directory.

Configuration to save standard output and standard error

By default, LSF redirects the standard output and standard error from checkpoint and restart executables to /dev/null and discards the data. You can modify this behavior by defining the parameter LSB_ECHKPNT_KEEP_OUTPUT as an environment variable or in lsf.conf.

Configuration file

Parameter and syntax

Behavior

lsf.conf

LSB_ECHKPNT_KEEP_OUTPUT=Y | y

  • The stdout and stderr for echkpnt.application or echkpnt.default are redirected to checkpoint_dir/job_ID/
    • echkpnt.out

    • echkpnt.err

  • The stdout and stderr for erestart.application or erestart.default are redirected to checkpoint_dir/job_ID/
    • erestart.out

    • erestart.err

Configuration to checkpoint jobs before suspending or terminating them

LSF administrators can configure LSF at the queue level to checkpoint jobs before suspending or terminating them.

Configuration file

Parameter and syntax

Behavior

lsb.queues

JOB_CONTROLS=SUSPEND CHKPNT TERMINATE

  • LSF checkpoints jobs before suspending or terminating them

  • When suspending a job, LSF checkpoints the job and then stops it by sending the SIGSTOP signal

  • When terminating a job, LSF checkpoints the job and then kills it

Configuration to copy open job files to the checkpoint directory

For hosts that use the Cray operating system, LSF administrators can configure LSF at the host level to copy all open job files to the checkpoint directory every time the job is checkpointed.

Configuration file

Parameter and syntax

Behavior

lsb.hosts

HOST_NAME     CHKPNT
host_name        C
  • LSF copies all open job files to the checkpoint directory when a job is checkpointed