Before you begin
You must know the exit values your pre-execution
script exits with that indicate failure.
About this task
Any non-zero exit code in a pre-execution script indicates
a failure. For those jobs that are designated as rerunable on failure,
LSF filters on the pre-execution script failure to determine whether
the job that failed in the pre-execution script should exclude the
host where the pre-execution script failed. That host is no longer
a candidate to run the job.
Procedure
- Create a pre-execution script that exits with a specific
value if it is unsuccessful.
Example:
#!/bin/sh
# Usually, when pre_exec failed due to host reason like
# /tmp is full, we should exit directly to let LSF
# re-dispatch the job to a different host.
# For example:
# define PREEXEC_RETRY_EXIT_VALUES = 10 in lsb.params
# exit 10 when pre_exec detect that /tmp is full.
# LSF will re-dispatch this job to a different host under
# such condition.
DISC=/tmp
PARTITION=`df -Ph | grep -w $DISC | awk '{print $6}'`
FREE=`df -Ph | grep -w $DISC | awk '{print $5}' | awk -F% '{print $1}'`
echo "$FREE"
if [ "${FREE}" != "" ]
then
if [ "${FREE}" -le "2" ] # When there's only 2% available space for
# /tmp on this host, we can let LSF
# re-dispatch the job to a different host
then
exit 10
fi
fi
# Sometimes, when pre_exec failed due to nfs server being busy,
# it can succeed if we retry it several times in this script to
# affect LSF performance less.
RETRY=10
while [ $RETRY -gt 0 ]
do
#mount host_name:/export/home/bill /home/bill
EXIT=`echo $?`
if [ $EXIT -eq 0 ]
then
RETRY=0
else
RETRY=`expr $RETRY - 1`
if [ $RETRY -eq 0 ]
then
exit 99 # We have tried for 9 times.
# Something is wrong with nfs server, we need
# to fail the job and fix the nfs problem first.
# We need to submit the job again after nfs problem
# is resolved.
fi
fi
done
- In lsb.params, use PREEXEC_EXCLUDE_HOST_EXIT_VALUES to set the
exit values that indicate the pre-execution script failed to run.
Values from 1-255 are allowed, excepting 99 (reserved value).
Separate values with a space.
For the example script above,
set PREEXEC_EXCLUDE_HOST_EXIT_VALUES=10.
- (Optional) Define MAX_PREEXEC_RETRY to
limit the total number of times LSF retries the pre-execution script
on hosts.
- Run badmin reconfig.
Results
If a pre-execution script exits with value 10 (according
to the example above), LSF adds this host to an exclusion list and
attempts to reschedule the job on another host.