Improve the speed of host status updates

LSF improves the speed of host status updates as follows:

  • Fast host status discovery after cluster startup
  • Multi-threaded UDP communications
  • Fast response to static or dynamic host status change
  • Simultaneously accepts new host registration
LSF features the following performance enhancements to achieve this improvement in speed:
  • LSB_SYNC_HOST_STAT_LIM (in lsb.params) is now enabled by default (previously, this was disabled by default), so there is no need to configure it in the configuration file. This parameter improves the speed with which mbatchd obtains host status, and therefore the speed with which LSF reschedules rerunnable jobs: the sooner LSF knows that a host has become unavailable, the sooner LSF reschedules any rerunnable jobs executing on that host. For example, during maintenance operations, the cluster administrator might need to shut down half of the hosts at once. LSF can quickly update the host status and reschedule any rerunnable jobs that were running on the unavailable hosts.
    Note: If you previously specified LSB_SYNC_HOST_STAT_LIM=N (to disable this parameter), change the parameter value to Y to improve performance.
  • The default setting for LSB_MAX_PROBE_SBD (in lsf.conf) was increased from 2 to 20. This parameter specifies the maximum number of sbatchd instances polled by mbatchd in the interval MBD_SLEEP_TIME/10. Use this parameter in large clusters to reduce the time it takes for mbatchd to probe all sbatchds.
    Note: If you previously specified a value for LSB_MAX_PROBE_SBD that is less than 20, remove your custom definition to use the default value of 20.
  • You can set a limit with MAX_SBD_FAIL (in lsb.params) for the maximum number of retries for reaching a non-responding slave batch daemon, sbatchd. If mbatchd fails to reach a host after the defined number of tries, the host is considered unavailable or unreachable.