LSF improves
the speed of host status updates as follows:
- Fast host status discovery after cluster startup
- Multi-threaded UDP communications
- Fast response to static or dynamic host status change
- Simultaneously accepts new host registration
LSF features
the following performance enhancements to achieve this improvement
in speed:
- LSB_SYNC_HOST_STAT_LIM (in lsb.params)
is now enabled by default (previously, this was disabled by default),
so there is no need to configure it in the configuration file. This
parameter improves the speed with which mbatchd obtains
host status, and therefore the speed with which LSF reschedules rerunnable
jobs: the sooner LSF knows that a host has become unavailable, the
sooner LSF reschedules any rerunnable jobs executing on that host.
For example, during maintenance operations, the cluster administrator
might need to shut down half of the hosts at once. LSF can quickly
update the host status and reschedule any rerunnable jobs that were
running on the unavailable hosts.
Note: If you
previously specified LSB_SYNC_HOST_STAT_LIM=N (to
disable this parameter), change the parameter value to Y to
improve performance.
- The default setting for LSB_MAX_PROBE_SBD
(in lsf.conf) was increased from 2 to 20. This
parameter specifies the maximum number of sbatchd instances
polled by mbatchd in the interval MBD_SLEEP_TIME/10.
Use this parameter in large clusters to reduce the time it takes for mbatchd to
probe all sbatchds.
Note: If
you previously specified a value for LSB_MAX_PROBE_SBD that
is less than 20, remove your custom definition to use the default
value of 20.
- You can set a limit with MAX_SBD_FAIL (in lsb.params)
for the maximum number of retries for reaching a non-responding slave
batch daemon, sbatchd. If mbatchd fails
to reach a host after the defined number of tries, the host is considered
unavailable or unreachable.