Control daemons

Permissions required

To control all daemons in the cluster, you must

  • Be logged on as root or as a user listed in the /etc/lsf.sudoers file. See the LSF Configuration Reference for configuration details of lsf.sudoers.

  • Be able to run the rsh or ssh commands across all LSF hosts without having to enter a password. See your operating system documentation for information about configuring the rsh and ssh commands. The shell command specified by LSF_RSH in lsf.conf is used before rsh is tried.

Daemon commands

The following is an overview of commands you use to control LSF daemons.

Daemon

Action

Command

Permissions

All in cluster

Start

lsfstartup

Must be root or a user listed in lsf.sudoers for all these commands

Shut down

lsfshutdown

sbatchd

Start

badmin hstartup [host_name ...|all]

Must be root or a user listed in lsf.sudoers for the startup command

Restart

badmin hrestart [host_name ...|all]

Must be root or the LSF administrator for other commands

Shut down

badmin hshutdown [host_name ...|all]

mbatchd

mbschd

Restart

badmin mbdrestart

Must be root or the LSF administrator for these commands

Shut down

  1. badmin hshutdown

  2. badmin mbdrestart

Reconfigure

badmin reconfig

RES

Start

lsadmin resstartup [host_name ...|all]

Must be root or a user listed in lsf.sudoers for the startup command

Shut down

lsadmin resshutdown [host_name ...|all]

Must be the LSF administrator for other commands

Restart

lsadmin resrestart [host_name ...|all]

LIM

Start

lsadmin limstartup [host_name ...|all]

Must be root or a user listed in lsf.sudoers for the startup command

Shut down

lsadmin limshutdown [host_name ...|all]

Must be the LSF administrator for other commands

Restart

lsadmin limrestart [host_name ...|all]

Restart all in cluster

lsadmin reconfig

sbatchd

Restarting sbatchd on a host does not affect jobs that are running on that host.

If sbatchd is shut down, the host is not available to run new jobs. Existing jobs running on that host continue, but the results are not sent to the user until sbatchd is restarted.

LIM and RES

Jobs running on the host are not affected by restarting the daemons.

If a daemon is not responding to network connections, lsadmin displays an error message with the host name. In this case, you must kill and restart the daemon manually.

If the LIM and the other daemons on the current master host shut down, another host automatically takes over as master.

If the RES is shut down while remote interactive tasks are running on the host, the running tasks continue but no new tasks are accepted.

LSF daemons / binaries protected from OS OOM Killer

The following LSF daemons are protected from being killed on systems that support out-of-memory (OOM) killer:

  • root RES
  • root LIM
  • root SBATCHD
  • pim
  • melim
  • mbatchd
  • rla
  • mbschd
  • krbrenewd
  • elim
  • lim -2(root)
  • mbatchd -2(root)

For the above daemons, oom_adj will automatically be set to -17 or oom_score_adj will be set to -1000 upon start/restart. This feature ensures that LSF daemons survive OOM killer but not user jobs.

When set daemons oom_adj/oom_score_adj are used, log messages are set to DEBUG level: “Set oom_adj to -17.” and “Set oom_score_adj to -1000.”

Root res, root lim, root sbatchd, pim, melim, and mbatchd protect themselves actively and will log messages.

All logs must set LSF_LOG_MASK as LOG_DEBUG.

In addition, the following must be set:
  • res must be configured as LSF_DEBUG_RES="LC_TRACE"
  • lim must be configured as LSF_DEBUG_LIM="LC_TRACE"

    When ego is enabled, must set EGO_LOG_MASK=LOG_DEBUG in ego.conf

  • sbatchd must be configured as LSB_DEBUG_SBD="LC_TRACE"
  • pim must be configured as LSF_DEBUG_PIM="LC_TRACE"
  • mbatchd must be configured as LSB_DEBUG_MBD="LC_TRACE"