Configuration file |
Parameter and syntax |
Behavior |
---|---|---|
lsb.queues |
CHKPNT=chkpnt_dir [chkpnt_period] |
|
lsb.applications |
Kernel-level checkpoint and restart is enabled by default. LSF users make a job checkpointable by either submitting a job using bsub -k and specifying a checkpoint directory or by submitting a job to a queue that defines a checkpoint directory for the CHKPNT parameter.
Have access to the command line used to submit or modify the job
Exit with a return value without running an application; the erestart interface runs the application to restart the job
Executable file |
UNIX naming convention |
Windows naming convention |
---|---|---|
echkpnt |
LSF_SERVERDIR/echkpnt.application |
LSF_SERVERDIR\echkpnt.application.exe LSF_SERVERDIR\echkpnt.application.bat |
erestart |
LSF_SERVERDIR/erestart.application |
LSF_SERVERDIR\erestart.application.exe LSF_SERVERDIR\erestart.application.bat |
The names echkpnt.default and erestart.default are reserved. Do not use these names for application-level checkpoint and restart executables.
Valid file names contain only alphanumeric characters, underscores (_), and hyphens (-).
For application-level checkpoint and restart, once the LSF_SERVERDIR contains one or more checkpoint and restart executables, users can specify the external checkpoint executable associated with each checkpointable job they submit. At restart, LSF invokes the corresponding external restart executable.
The executables must be written in C or Fortran.
The directory/name combinations must be unique within the cluster. For example, you can write two different checkpoint executables with the name echkpnt.fluent and save them as LSF_SERVERDIR/echkpnt.fluent and my_execs/echkpnt.fluent. To run checkpoint and restart executables from a directory other than LSF_SERVERDIR, you must configure the parameter LSB_ECHKPNT_METHOD_DIR in lsf.conf.
An echkpnt.application must return a value of 0 when checkpointing succeeds and a non-zero value when checkpointing fails.
A non-zero value indicates that erestart.application failed to write to the .restart_cmd file.
A return value of 0 indicates that erestart.application successfully wrote to the .restart_cmd file, or that the executable intentionally did not write to the file.
echkpnt [-c] [-f] [-k | -s] [-d checkpoint_dir] [-x] process_group_ID
The -k and -s options are mutually exclusive.
erestart [-c] [-f] checkpoint_dir
Option or variable |
Description |
Operating systems |
---|---|---|
-c |
Copies all files in use by the checkpointed process to the checkpoint directory. |
Some |
-f |
Forces a job to be checkpointed even under non-checkpointable conditions, which are specific to the checkpoint implementation used. This option could create checkpoint files that do not provide for successful restart. |
Some |
-k |
Kills a job after successful checkpointing. If checkpoint fails, the job continues to run. |
All operating systems that LSF supports |
-s |
Stops a job after successful checkpointing. If checkpoint fails, the job continues to run. |
Some |
-d checkpoint_dir |
Specifies the checkpoint directory as a relative or absolute path. |
All operating systems that LSF supports |
-x |
Identifies the cpr (checkpoint and restart) process as type HID. This identifies the set of processes to checkpoint as a process hierarchy (tree) rooted at the current PID. |
Some |
process_group_ID |
ID of the process or process group to checkpoint. |
All operating systems that LSF supports |