Configuration to modify pre- and post-execution processing

Configuration parameters modify various aspects of pre- and post-execution processing behavior by:

Preventing a new job from starting until post-execution processing has finished
Controlling the length of time post-execution processing can run
Specifying a user account under which the pre- and post-execution commands run
Controlling how many times pre-execution retries
Determining if email providing details of the post execution output should be sent to the user who submitted the job. See LSB_POSTEXEC_SEND_MAIL in the IBM Platform LSF Configuration Reference for more detail.

Some configuration parameters only apply to job-based pre- and post-execution processing and some apply to both job- and host-based pre- and post-execution processing:

Job- and host-based	Job-based only
JOB_INCLUDE_POSTPROC in lsb.applications and lsb.params MAX_PREEXEC_RETRY in lsb.applications and lsb.params LOCAL_MAX_PREEXEC_RETRY in lsb.applications and lsb.params LOCAL_MAX_PREEXEC_RETRY_ACTION in lsb.applications, lsb.queues, and lsb.params REMOTE_MAX_PREEXEC_RETRY in lsb.applications and lsb.params LSB_DISABLE_RERUN_POST_EXEC in lsf.conf JOB_PREPROC_TIMEOUT in lsb.applications and lsb.params JOB_POSTPROC_TIMEOUT in lsb.applications and lsb.params LSB_PRE_POST_EXEC_USER in lsf.sudoers LSB_POSTEXEC_SEND_MAIL in lsf.conf	PREEXEC_EXCLUDE_HOST_EXIT_VALUES in lsb.params

Job- and host-based

Job-based only

JOB_INCLUDE_POSTPROC in lsb.applications and lsb.params

MAX_PREEXEC_RETRY in lsb.applications and lsb.params

LOCAL_MAX_PREEXEC_RETRY in lsb.applications and lsb.params

Start of change LOCAL_MAX_PREEXEC_RETRY_ACTION in lsb.applications, lsb.queues, and lsb.params End of change

REMOTE_MAX_PREEXEC_RETRY in lsb.applications and lsb.params

LSB_DISABLE_RERUN_POST_EXEC in lsf.conf

JOB_PREPROC_TIMEOUT in lsb.applications and lsb.params

JOB_POSTPROC_TIMEOUT in lsb.applications and lsb.params

LSB_PRE_POST_EXEC_USER in lsf.sudoers

LSB_POSTEXEC_SEND_MAIL in lsf.conf

PREEXEC_EXCLUDE_HOST_EXIT_VALUES in lsb.params

See the IBM Platform LSF Configuration Reference for detail on each parameter.

JOB_PREPROC_TIMEOUT is designed to protect the system from hanging during pre-execution processing. When LSF detects pre-execution processing is running longer than the JOB_PREPROC_TIMEOUT value (the default value is infinite), LSF will terminate the execution. Therefore, the LSF Administrator should ensure JOB_PREPROC_TIMEOUT is set to a value longer than any pre-execution processing is required. JOB_POSTPROC_TIMEOUT should also be set to a value that gives host-based post execution processing enough time to run.

Configuration to modify when new jobs can start

When a job finishes, sbatchd reports a job finish status of DONE or EXIT to mbatchd. This causes LSF to release resources associated with the job, allowing new jobs to start on the execution host before post-execution processing from a previous job has finished.

In some cases, you might want to prevent the overlap of a new job with post-execution processing. Preventing a new job from starting prior to completion of post-execution processing can be configured at the application level or at the job level.

At the job level, the bsub -w option allows you to specify job dependencies; the keywords post_done and post_err cause LSF to wait for completion of post-execution processing before starting another job.

At the application level:

File	Parameter and syntax	Description
lsb.applications lsb.params	JOB_INCLUDE_POSTPROC=Y	Enables completion of post-execution processing before LSF reports a job finish status of DONE or EXIT Prevents a new job from starting on a host until post-execution processing is finished on that host

File

Parameter and syntax

Description

lsb.applications

lsb.params

JOB_INCLUDE_POSTPROC=Y

Enables completion of post-execution processing before LSF reports a job finish status of DONE or EXIT
Prevents a new job from starting on a host until post-execution processing is finished on that host

sbatchd sends both job finish status (DONE or EXIT) and post-execution processing status (POST_DONE or POST_ERR) to mbatchd at the same time
The job remains in the RUN state and holds its job slot until post-execution processing has finished
Job requeue happens (if required) after completion of post-execution processing, not when the job itself finishes
For job history and job accounting, the job CPU and run times include the post-execution processing CPU and run times
The job control commands bstop, bkill, and bresume have no effect during post-execution processing
If a host becomes unavailable during post-execution processing for a rerunnable job, mbatchd sees the job as still in the RUN state and reruns the job
LSF does not preempt jobs during post-execution processing

Configuration to modify the post-execution processing time

Controlling the length of time post-execution processing can run is configured at the application level.

File	Parameter and syntax	Description
lsb.applications lsb.params	JOB_POSTPROC_TIMEOUT=`minutes`	Specifies the length of time, in minutes, that post-execution processing can run. The specified value must be greater than zero. If post-execution processing takes longer than the specified value, sbatchd reports post-execution failure—a status of POST_ERR. On UNIX and Linux, it kills the entire process group of the job's pre-execution processes. On Windows, only the parent process of the pre-execution command is killed when the timeout expires, the child processes of the pre-execution command are not killed. If JOB_INCLUDE_POSTPROC=Y and sbatchd kills the post-execution process group, post-execution processing CPU time is set to zero, and the job’s CPU time does not include post-execution CPU time.

File

Parameter and syntax

Description

lsb.applications

lsb.params

JOB_POSTPROC_TIMEOUT=minutes

Specifies the length of time, in minutes, that post-execution processing can run.
The specified value must be greater than zero.
If post-execution processing takes longer than the specified value, sbatchd reports post-execution failure—a status of POST_ERR. On UNIX and Linux, it kills the entire process group of the job's pre-execution processes. On Windows, only the parent process of the pre-execution command is killed when the timeout expires, the child processes of the pre-execution command are not killed.
If JOB_INCLUDE_POSTPROC=Y and sbatchd kills the post-execution process group, post-execution processing CPU time is set to zero, and the job’s CPU time does not include post-execution CPU time.

Configuration to modify the pre- and post-execution processing user account

Specifying a user account under which the pre- and post-execution commands run is configured at the system level. By default, both the pre- and post-execution commands run under the account of the user who submits the job.

File	Parameter and syntax	Description
lsf.sudoers	LSB_PRE_POST_EXEC_USER =`user_name`	Specifies the user account under which pre- and post-execution commands run (UNIX only) This parameter applies only to pre- and post-execution commands configured at the queue level; pre-execution commands defined at the application or job level run under the account of the user who submits the job If the pre-execution or post-execution commands perform privileged operations that require root permissions on UNIX hosts, specify a value of root You must edit the lsf.sudoers file on all UNIX hosts within the cluster and specify the same user account

File

Parameter and syntax

Description

lsf.sudoers

LSB_PRE_POST_EXEC_USER

=user_name

Specifies the user account under which pre- and post-execution commands run (UNIX only)
This parameter applies only to pre- and post-execution commands configured at the queue level; pre-execution commands defined at the application or job level run under the account of the user who submits the job
If the pre-execution or post-execution commands perform privileged operations that require root permissions on UNIX hosts, specify a value of root
You must edit the lsf.sudoers file on all UNIX hosts within the cluster and specify the same user account

Configuration to control how many times pre-execution retries

By default, if job pre-execution fails, LSF retries the job automatically. The job remains in the queue and pre-execution is retried 5 times by default, to minimize any impact to performance and throughput.

Limiting the number of times LSF retries job pre-execution is configured cluster-wide (lsb.params), at the queue level (lsb.queues), and at the application level (lsb.applications). Pre-execution retry in lsb.applications overrides lsb.queues, and lsb.queues overrides lsb.params configuration.

Configuration file	Parameter and syntax	Behavior
lsb.params	LOCAL_MAX_PREEXEC_RETRY=`integer`	Controls the maximum number of times to attempt the pre-execution command of a job on the local cluster. Specify an integer greater than 0 By default, the number of retries is unlimited.
	MAX_PREEXEC_RETRY=`integer`	Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster. Specify an integer greater than 0 By default, the number of retries is 5.
	REMOTE_MAX_PREEXEC_RETRY=`integer`	Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster. Equivalent to MAX_PREEXEC_RETRY Specify an integer greater than 0 By default, the number of retries is 5.
lsb.queues	LOCAL_MAX_PREEXEC_RETRY=`integer`	Controls the maximum number of times to attempt the pre-execution command of a job on the local cluster. Specify an integer greater than 0 By default, the number of retries is unlimited.
	MAX_PREEXEC_RETRY=`integer`	Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster. Specify an integer greater than 0 By default, the number of retries is 5.
	REMOTE_MAX_PREEXEC_RETRY=`integer`	Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster. Equivalent to MAX_PREEXEC_RETRY Specify an integer greater than 0 By default, the number of retries is 5.
lsb.applications	LOCAL_MAX_PREEXEC_RETRY=`integer`	Controls the maximum number of times to attempt the pre-execution command of a job on the local cluster. Specify an integer greater than 0 By default, the number of retries is unlimited.
	MAX_PREEXEC_RETRY=`integer`	Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster. Specify an integer greater than 0 By default, the number of retries is 5.
	REMOTE_MAX_PREEXEC_RETRY=`integer`	Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster. Equivalent to MAX_PREEXEC_RETRY Specify an integer greater than 0 By default, the number of retries is 5.

When pre-execution retry is configured, if a job pre-execution fails and exits with non-zero value, the number of pre-exec retries is set to 1. When the pre-exec retry limit is reached, the job is suspended with PSUSP status.

The number of times that pre-execution is retried includes queue-level, application-level, and job-level pre-execution command specifications. When pre-execution retry is configured, a job will be suspended when the sum of its queue-level pre-exec retry times + application-level pre-exec retry times is greater than the value of the pre-execution retry parameter or if the sum of its queue-level pre-exec retry times + job-level pre-exec retry times is greater than the value of the pre-execution retry parameter.

The pre-execution retry limit is recovered when LSF is restarted and reconfigured. LSF replays the pre-execution retry limit in the PRE_EXEC_START or JOB_STATUS events in lsb.events.

Configuration to define default behavior of a job after it reaches the pre-execution retry limit

By default, if LSF retries the pre-execution command of a job on the local cluster and reaches the pre-execution retry threshold (LOCAL_MAX_PREEXEC_RETRY in lsb.params, lsb.queues, or lsb.applications), LSF suspends the job.

This default behavior of a job that has reached the pre-execution retry limit is configured cluster-wide (lsb.params), at the queue level (lsb.queues), and at the application level (lsb.applications). The behavior specified in lsb.applications overrides lsb.queues, and lsb.queues overrides the lsb.params configuration.

Configuration file	Parameter and syntax	Behavior
lsb.params	LOCAL_MAX_PREEXEC_RETRY_ACTION = SUSPEND \| EXIT	Specifies the default behavior of a job (on the local cluster) that has reached the maximum pre-execution retry limit. If set to SUSPEND, the job is suspended and its status is set to PSUSP. If set to EXIT, the job status is set to EXIT and the exit code is the same as the last pre-execution fail exit code. By default, the job is suspended.
lsb.queues	LOCAL_MAX_PREEXEC_RETRY_ACTION = SUSPEND \| EXIT	Specifies the default behavior of a job (on the local cluster) that has reached the maximum pre-execution retry limit. If set to SUSPEND, the job is suspended and its status is set to PSUSP. If set to EXIT, the job status is set to EXIT and the exit code is the same as the last pre-execution fail exit code. By default, this is not defined.
lsb.applications	LOCAL_MAX_PREEXEC_RETRY_ACTION = SUSPEND \| EXIT	Specifies the default behavior of a job (on the local cluster) that has reached the maximum pre-execution retry limit. If set to SUSPEND, the job is suspended and its status is set to PSUSP. If set to EXIT, the job status is set to EXIT and the exit code is the same as the last pre-execution fail exit code. By default, this is not defined.