Set host exclusion based on job-based pre-execution scripts

Before you begin

You must know the exit values your pre-execution script exits with that indicate failure.

About this task

Any non-zero exit code in a pre-execution script indicates a failure. For those jobs that are designated as rerunable on failure, LSF filters on the pre-execution script failure to determine whether the job that failed in the pre-execution script should exclude the host where the pre-execution script failed. That host is no longer a candidate to run the job.

Procedure

  1. Create a pre-execution script that exits with a specific value if it is unsuccessful.
    Example:
    #!/bin/sh
    
    # Usually, when pre_exec failed due to host reason like
    # /tmp is full, we should exit directly to let LSF
    # re-dispatch the job to a different host.
    # For example:
    #     define PREEXEC_RETRY_EXIT_VALUES = 10 in lsb.params
    #     exit 10 when pre_exec  detect that /tmp is full.
    # LSF will re-dispatch this job to a different host under
    # such condition.
    DISC=/tmp
    PARTITION=`df -Ph | grep -w $DISC | awk '{print $6}'`
    FREE=`df -Ph | grep -w $DISC | awk '{print $5}' | awk -F% '{print $1}'`
    
    echo "$FREE"
    if [ "${FREE}" != "" ]
    then
        if [ "${FREE}" -le "2" ] # When there's only 2% available space for
                                 # /tmp on this host, we can let LSF 
                                # re-dispatch the job to a different host 
    
       then
            exit 10
        fi
    fi
    
    # Sometimes, when pre_exec failed due to nfs server being busy,
    # it can succeed if we retry it several times in this script to 
    # affect LSF performance less.
    RETRY=10
    while [ $RETRY -gt 0 ]
    do
        #mount host_name:/export/home/bill /home/bill
        EXIT=`echo $?` 
       if [ $EXIT -eq 0 ]
        then 
         RETRY=0 
     else 
           RETRY=`expr $RETRY - 1`
            if [ $RETRY -eq 0 ]
            then
               exit 99 # We have tried for 9 times.
                       # Something is wrong with nfs server, we need
                       # to fail the job and fix the nfs problem first.
                       # We need to submit the job again after nfs problem
                       # is resolved.
            fi
        fi
    done
  2. In lsb.params, use PREEXEC_EXCLUDE_HOST_EXIT_VALUES to set the exit values that indicate the pre-execution script failed to run.

    Values from 1-255 are allowed, excepting 99 (reserved value). Separate values with a space.

    For the example script above, set PREEXEC_EXCLUDE_HOST_EXIT_VALUES=10.

  3. (Optional) Define MAX_PREEXEC_RETRY to limit the total number of times LSF retries the pre-execution script on hosts.
  4. Run badmin reconfig.

Results

If a pre-execution script exits with value 10 (according to the example above), LSF adds this host to an exclusion list and attempts to reschedule the job on another host.