Configuration parameters modify various aspects of pre-
and post-execution processing behavior by:
- Preventing a new job from starting until post-execution processing has finished
- Controlling the length of time post-execution processing can run
- Specifying a user account under which the pre- and post-execution commands run
- Controlling how many times pre-execution retries
- Determining if email providing details of the post execution output should be sent to the user who submitted the job. See LSB_POSTEXEC_SEND_MAIL in the IBM Platform LSF Configuration Reference for more detail.
Some configuration parameters only apply to
job-based pre- and post-execution processing and some apply to both
job- and host-based pre- and post-execution processing:
Job- and host-based
|
Job-based only
|
JOB_INCLUDE_POSTPROC in
lsb.applications and lsb.params
MAX_PREEXEC_RETRY in
lsb.applications and lsb.params
LOCAL_MAX_PREEXEC_RETRY in
lsb.applications and lsb.params
LOCAL_MAX_PREEXEC_RETRY_ACTION in lsb.applications, lsb.queues, and lsb.params 
REMOTE_MAX_PREEXEC_RETRY in
lsb.applications and lsb.params
LSB_DISABLE_RERUN_POST_EXEC in
lsf.conf
JOB_PREPROC_TIMEOUT in
lsb.applications and lsb.params
JOB_POSTPROC_TIMEOUT in
lsb.applications and lsb.params
LSB_PRE_POST_EXEC_USER in
lsf.sudoers
LSB_POSTEXEC_SEND_MAIL in
lsf.conf
|
PREEXEC_EXCLUDE_HOST_EXIT_VALUES in
lsb.params
|
See the IBM Platform LSF Configuration
Reference for detail on each parameter.
JOB_PREPROC_TIMEOUT is
designed to protect the system from hanging during pre-execution processing.
When LSF detects pre-execution processing is running longer than the JOB_PREPROC_TIMEOUT value
(the default value is infinite), LSF will terminate the execution.
Therefore, the LSF Administrator should ensure JOB_PREPROC_TIMEOUT is
set to a value longer than any pre-execution processing is required. JOB_POSTPROC_TIMEOUT should
also be set to a value that gives host-based post execution processing
enough time to run.
Configuration to modify when new jobs can start
When
a job finishes, sbatchd reports a job finish status
of DONE or EXIT to mbatchd.
This causes LSF to release resources associated with the job, allowing
new jobs to start on the execution host before post-execution processing
from a previous job has finished.
In some cases, you might
want to prevent the overlap of a new job with post-execution processing.
Preventing a new job from starting prior to completion of post-execution
processing can be configured at the application level or at the job
level.
At the job level, the bsub -w option
allows you to specify job dependencies; the keywords post_done and post_err cause
LSF to wait for completion of post-execution processing before starting
another job.
At the application level:
File
|
Parameter and syntax
|
Description
|
lsb.applications
lsb.params
|
JOB_INCLUDE_POSTPROC=Y
|
Enables
completion of post-execution processing before LSF reports a job finish
status of DONE or EXIT
Prevents a new job from starting on a host until post-execution
processing is finished on that host
|
- sbatchd sends both job finish status (DONE or EXIT) and post-execution processing status (POST_DONE or POST_ERR) to mbatchd at the same time
- The job remains in the RUN state and holds its job slot until post-execution processing has finished
- Job requeue happens (if required) after completion of post-execution processing, not when the job itself finishes
- For job history and job accounting, the job CPU and run times include the post-execution processing CPU and run times
- The job control commands bstop, bkill, and bresume have no effect during post-execution processing
- If a host becomes unavailable during post-execution processing for a rerunnable job, mbatchd sees the job as still in the RUN state and reruns the job
- LSF does not preempt jobs during post-execution processing
Configuration to modify the post-execution processing
time
Controlling the length of time post-execution processing
can run is configured at the application level.
File
|
Parameter and syntax
|
Description
|
lsb.applications
lsb.params
|
JOB_POSTPROC_TIMEOUT=minutes
|
Specifies
the length of time, in minutes, that post-execution processing can
run.
The specified value must be greater than zero.
If post-execution processing takes longer than the specified
value, sbatchd reports post-execution failure—a
status of POST_ERR. On UNIX and Linux, it kills the entire process
group of the job's pre-execution processes. On Windows, only
the parent process of the pre-execution command is killed when the
timeout expires, the child processes of the pre-execution command
are not killed.
If JOB_INCLUDE_POSTPROC=Y and sbatchd kills
the post-execution process group, post-execution processing CPU time
is set to zero, and the job’s CPU time does not include post-execution
CPU time.
|
Configuration to modify the pre- and post-execution
processing user account
Specifying a user account under
which the pre- and post-execution commands run is configured at the
system level. By default, both the pre- and post-execution commands
run under the account of the user who submits the job.
File
|
Parameter and syntax
|
Description
|
lsf.sudoers
|
LSB_PRE_POST_EXEC_USER
=user_name
|
Specifies
the user account under which pre- and post-execution commands run
(UNIX only)
This parameter applies only to pre- and post-execution commands
configured at the queue level; pre-execution commands defined at the
application or job level run under the account of the user who submits
the job
If the pre-execution or post-execution commands perform privileged
operations that require root permissions on UNIX
hosts, specify a value of root
You must edit the lsf.sudoers file on
all UNIX hosts within the cluster and specify the same user account
|
Configuration to control how many times pre-execution
retries
By default, if job pre-execution fails, LSF retries
the job automatically. The job remains in the queue and pre-execution
is retried 5 times by default, to minimize any impact to performance
and throughput.
Limiting the number of times LSF retries job
pre-execution is configured cluster-wide (lsb.params),
at the queue level (lsb.queues), and at the application
level (lsb.applications). Pre-execution retry
in lsb.applications overrides lsb.queues,
and lsb.queues overrides lsb.params configuration.
Configuration file
|
Parameter and syntax
|
Behavior
|
lsb.params
|
LOCAL_MAX_PREEXEC_RETRY=integer
|
Controls
the maximum number of times to attempt the pre-execution command of
a job on the local cluster.
Specify an integer greater than 0
By default, the number
of retries is unlimited.
|
|
MAX_PREEXEC_RETRY=integer
|
Controls the maximum number of times to attempt the pre-execution
command of a job on the remote cluster.
Specify an integer greater than 0
By default, the number
of retries is 5.
|
|
REMOTE_MAX_PREEXEC_RETRY=integer
|
Controls the maximum number of times to attempt the pre-execution
command of a job on the remote cluster.
Equivalent to MAX_PREEXEC_RETRY
Specify an integer greater than 0
By default, the number
of retries is 5.
|
lsb.queues
|
LOCAL_MAX_PREEXEC_RETRY=integer
|
Controls
the maximum number of times to attempt the pre-execution command of
a job on the local cluster.
Specify an integer greater than 0
By default, the number
of retries is unlimited.
|
|
MAX_PREEXEC_RETRY=integer
|
Controls the maximum number of times to attempt the pre-execution
command of a job on the remote cluster.
Specify an integer greater than 0
By default, the number
of retries is 5.
|
|
REMOTE_MAX_PREEXEC_RETRY=integer
|
Controls the maximum number of times to attempt the pre-execution
command of a job on the remote cluster.
Equivalent to MAX_PREEXEC_RETRY
Specify an integer greater than 0
By default, the number
of retries is 5.
|
lsb.applications
|
LOCAL_MAX_PREEXEC_RETRY=integer
|
Controls
the maximum number of times to attempt the pre-execution command of
a job on the local cluster.
Specify an integer greater than 0
By default, the number
of retries is unlimited.
|
|
MAX_PREEXEC_RETRY=integer
|
Controls the maximum number of times to attempt the pre-execution
command of a job on the remote cluster.
Specify an integer greater than 0
By default, the number
of retries is 5.
|
|
REMOTE_MAX_PREEXEC_RETRY=integer
|
Controls the maximum number of times to attempt the pre-execution
command of a job on the remote cluster.
Equivalent to MAX_PREEXEC_RETRY
Specify an integer greater than 0
By default, the number
of retries is 5.
|
When pre-execution retry is configured, if a job pre-execution
fails and exits with non-zero value, the number of pre-exec retries
is set to 1. When the pre-exec retry limit is reached, the job is
suspended with PSUSP status.
The number of times that pre-execution
is retried includes queue-level, application-level, and job-level
pre-execution command specifications. When pre-execution retry is
configured, a job will be suspended when the sum of its queue-level
pre-exec retry times + application-level pre-exec retry times is greater
than the value of the pre-execution retry parameter or if the sum
of its queue-level pre-exec retry times + job-level pre-exec retry
times is greater than the value of the pre-execution retry parameter.
The
pre-execution retry limit is recovered when LSF is restarted and reconfigured.
LSF replays the pre-execution retry limit in the PRE_EXEC_START or
JOB_STATUS events in lsb.events.
Configuration to define default behavior of a job after it reaches the pre-execution retry limit
By default, if LSF retries the pre-execution command of a job on the local cluster and reaches the pre-execution retry threshold (LOCAL_MAX_PREEXEC_RETRY in lsb.params, lsb.queues, or lsb.applications), LSF suspends the job.
This default behavior of a job that has reached the pre-execution retry limit is configured cluster-wide (lsb.params), at the queue level (lsb.queues), and at the application level (lsb.applications). The behavior specified in lsb.applications overrides lsb.queues, and lsb.queues overrides the lsb.params configuration.
Configuration file
|
Parameter and syntax
|
Behavior
|
lsb.params
|
LOCAL_MAX_PREEXEC_RETRY_ACTION = SUSPEND | EXIT
|
- Specifies the default behavior of a job (on the local cluster) that has reached the maximum pre-execution retry limit.
- If set to SUSPEND, the job is suspended and its status is set to PSUSP.
If set to EXIT, the job status is set to EXIT and the exit code is the same as the last pre-execution fail exit code.
By default, the job is suspended.
|
lsb.queues
|
LOCAL_MAX_PREEXEC_RETRY_ACTION = SUSPEND | EXIT
|
- Specifies the default behavior of a job (on the local cluster) that has reached the maximum pre-execution retry limit.
- If set to SUSPEND, the job is suspended and its status is set to PSUSP.
If set to EXIT, the job status is set to EXIT and the exit code is the same as the last pre-execution fail exit code.
By default, this is not defined.
|
lsb.applications
|
LOCAL_MAX_PREEXEC_RETRY_ACTION = SUSPEND | EXIT
|
- Specifies the default behavior of a job (on the local cluster) that has reached the maximum pre-execution retry limit.
- If set to SUSPEND, the job is suspended and its status is set to PSUSP.
If set to EXIT, the job status is set to EXIT and the exit code is the same as the last pre-execution fail exit code.
By default, this is not defined.
|
