Common LSF problems

About this task

Most problems are due to incorrect installation or configuration.

Procedure

Check the error log files first.

Often the log message points directly to the problem.

LIM dies quietly

Procedure

Run the following command to check for errors in the LIM configuration files.

lsadmin ckconfig -v

This displays most configuration errors. If this does not report any errors, check in the LIM error log.

LIM unavailable

About this task

Sometimes the LIM is up, but executing the lsload command prints the following error message:

Communication time out.

If the LIM has just been started, this is normal, because the LIM needs time to get initialized by reading configuration files and contacting other LIMs. If the LIM does not become available within one or two minutes, check the LIM error log for the host you are working on.

To prevent communication timeouts when starting or restarting the local LIM, define the parameter LSF_SERVER_HOSTS in the lsf.conf file. The client will contact the LIM on one of the LSF_SERVER_HOSTS and execute the command, provided that at least one of the hosts defined in the list has a LIM that is up and running.

When the local LIM is running but there is no master LIM in the cluster, LSF applications display the following message:

Cannot locate master LIM now, try later.

Procedure

Check the LIM error logs on the first few hosts listed in the Host section of the lsf.cluster.cluster_name file. If LSF_MASTER_LIST is defined in lsf.conf, check the LIM error logs on the hosts listed in this parameter instead.

Master LIM is down

About this task

Sometimes the master LIM is up, but executing the lsload or lshosts command prints the following error message:

Master LIM is down; try later

If the /etc/hosts file on the host where the master LIM is running is configured with the host name assigned to the loopback IP address (127.0.0.1), LSF client LIMs cannot contact the master LIM. When the master LIM starts up, it sets its official host name and IP address to the loopback address. Any client requests will get the master LIM address as 127.0.0.1, and try to connect to it, and in fact will try to access itself.

Procedure

Check the IP configuration of your master LIM in /etc/hosts. The following example incorrectly sets the master LIM IP address to the loopback address:
127.0.0.1         localhost      myhostname

The following example correctly sets the master LIM IP address:

127.0.0.1         localhost
192.168.123.123   myhostname
For a master LIM running on a host that uses an IPv6 address, the loopback address is
::1 
The following example correctly sets the master LIM IP address using an IPv6 address:
::1         localhost ipv6-localhost ipv6-loopback 
 
fe00::0         ipv6-localnet 
 
ff00::0         ipv6-mcastprefix
ff02::1         ipv6-allnodes
ff02::2         ipv6-allrouters
ff02::3         ipv6-allhosts

RES does not start

Procedure

Check the RES error log.

User permission denied

About this task

If remote execution fails with the following error message, the remote host could not securely determine the user ID of the user requesting remote execution.

User permission denied.

Procedure

  1. Check the RES error log on the remote host; this usually contains a more detailed error message.
  2. If you are not using an identification daemon (LSF_AUTH is not defined in the lsf.conf file), then all applications that do remote executions must be owned by root with the setuid bit set. This can be done as follows.

    chmod 4755 filename

  3. If the binaries are on an NFS-mounted file system, make sure that the file system is not mounted with the nosuid flag.
  4. If you are using an identification daemon (defined in the lsf.conf file by LSF_AUTH), inetd must be configured to run the daemon. The identification daemon must not be run directly.
  5. If LSF_USE_HOSTEQUIV is defined in the lsf.conf file, check if /etc/hosts.equiv or HOME/.rhosts on the destination host has the client host name in it. Inconsistent host names in a name server with /etc/hosts and /etc/hosts.equiv can also cause this problem.
  6. For Windows hosts, users must register and update their Windows passwords using the lspasswd command. Passwords must be 3 characters or longer, and 31 characters or less.

    For Windows password authentication in a non-shared file system environment, you must define the parameter LSF_MASTER_LIST in lsf.conf so that jobs will run with correct permissions. If you do not define this parameter, LSF assumes that the cluster uses a shared file system environment.

Non-uniform file name space

About this task

A command may fail with the following error message due to a non-uniform file name space.

chdir(...) failed: no such file or directory

You are trying to execute a command remotely, where either your current working directory does not exist on the remote host, or your current working directory is mapped to a different name on the remote host.

Procedure

If your current working directory does not exist on a remote host, you should not execute commands remotely on that host.

On UNIX

Procedure

  • If the directory exists, but is mapped to a different name on the remote host, you have to create symbolic links to make them consistent.
  • LSF can resolve most, but not all, problems using automount. The automount maps must be managed through NIS.

    Follow the instructions in your Release Notes for obtaining technical support if you are running automount and LSF is not able to locate directories on remote hosts.

Batch daemons die quietly

Procedure

First, check the sbatchd and mbatchd error logs. Try running the following command to check the configuration.

badmin ckconfig

This reports most errors. You should also check if there is any email in the LSF administrator’s mailbox. If the mbatchd is running but the sbatchd dies on some hosts, it may be because mbatchd has not been configured to use those hosts.

sbatchd starts but mbatchd does not

Procedure

  1. Check whether LIM is running. You can test this by running the lsid command. If LIM is not running properly, follow the suggestions in this chapter to fix the LIM first. It is possible that mbatchd is temporarily unavailable because the master LIM is temporarily unknown, causing the following error message.

    sbatchd: unknown service

  2. Check whether services are registered properly.

Detached processes

About this task

LSF uses process groups to keep track of all the processes of a job. See Process tracking through cgroups for more details.

Procedure

  1. When a job is launched, the application runs under the job-RES (or root) process group.
  2. If an application creates a new process group, and its PPID still belongs to the job, the PIM can track this new process group as part of the job.

    However, if the application forks a child, the child becomes a new process group, and the parent dies immediately, the child process group is now orphaned, and cannot be tracked

    Any process that daemonizes itself is almost certainly lost (orphans child processes) because it changes its process group right after being detached. The only reliable way to not lose track of a process is to prevent it from using a new process group.

Host not used by LSF

About this task

mbatchd allows sbatchd to run only on the hosts that are listed in the Host section of the lsb.hosts file. If you try to configure an unknown host in the HostGroup or HostPartition sections of the lsb.hosts file, or as a HOSTS definition for a queue in the lsb.queues file, mbatchd logs the following message.

mbatchd on host: LSB_CONFDIR/cluster1/configdir/file(line #): Host hostname is not used by lsbatch; ignored

If you start sbatchd on a host that is not known by mbatchd, mbatchd rejects the sbatchd. The sbatchd logs the following message and exits.

This host is not used by lsbatch system.

Procedure

Run the following commands, in order, after adding a host to the configuration and before starting the deamons on the new host:

lsadmin reconfig

badmin reconfig

View UNKNOWN host type or model

Procedure

Run lshosts. A model or type UNKNOWN indicates that the host is down or the LIM on the host is down. You need to take immediate action. For example:
lshosts
HOST_NAME  type       model   cpuf   ncpus  maxmem   maxswp  server   RESOURCES 
hostA   UNKNOWN      Ultra2   20.2       2    256M    710M      Yes   ()

Fix UNKNOWN matched host type or matched model

Procedure

  1. Start the host.
  2. Run lsadmin limstartup to start LIM on the host.
    For example:
    lsadmin limstartup hostAStarting up LIM on <hostA> .... done
    or, if EGO is enabled in the LSF cluster, you can also run:
    egosh ego start lim hostAStarting up LIM on <hostA> .... done

    You can specify more than one host name to start up LIM on multiple hosts. If you do not specify a host name, LIM is started up on the host from which the command is submitted.

    On UNIX, in order to start up LIM remotely, you must be root or listed in lsf.sudoers (or ego.sudoers if EGO is enabled in the LSF cluster) and be able to run the rsh command across all hosts without entering a password.

  3. Wait a few seconds, then run lshosts again. You should now be able to see a specific model or type for the host or DEFAULT. If you see DEFAULT, it means that automatic detection of host type or model has failed, and the host type configured in lsf.shared cannot be found. LSF will work on the host, but a DEFAULT model may be inefficient because of incorrect CPU factors. A DEFAULT type may also cause binary incompatibility because a job from a DEFAULT host type can be migrated to another DEFAULT host type.

View DEFAULT host type or model

About this task

If you see DEFAULT in lim -t, it means that automatic detection of host type or model has failed, and the host type configured in lsf.shared cannot be found. LSF will work on the host, but a DEFAULT model may be inefficient because of incorrect CPU factors. A DEFAULT type may also cause binary incompatibility because a job from a DEFAULT host type can be migrated to anotherDEFAULT host type.

Procedure

Run lshosts. If Model or Type are displayed as DEFAULT when you use lshosts and automatic host model and type detection is enabled, you can leave it as is or change it. For example:
lshosts
HOST_NAME     type    model     cpuf   ncpus  maxmem  maxswp   server  RESOURCES 
hostA      DEFAULT  DEFAULT        1       2    256M   710M       Yes  ()

If model is DEFAULT, LSF will work correctly but the host will have a CPU factor of 1, which may not make efficient use of the host model.

If type is DEFAULT, there may be binary incompatibility. For example, there are two hosts, one is Solaris, the other is HP. If both hosts are set to type DEFAULT, it means that jobs running on the Solaris host can be migrated to the HP host and vice-versa.

Fix DEFAULT matched host type or matched model

Procedure

  1. Run lim -t on the host whose type is DEFAULT:
    lim -t
    Host Type             : NTX64
    Host Architecture     : EM64T_1596
    Total NUMA Nodes		  : 1
    Total Processors      : 2
    Total Cores           : 4
    Total Threads         : 2
    Matched Type          : NTX64
    Matched Architecture  : EM64T_3000
    Matched Model         : Intel_EM64T
    CPU Factor            : 60.0

    Note the value of Host Type and Host Architecture.

  2. Edit lsf.shared.
    1. In the HostType section, enter a new host type. Use the host type name detected with lim -t. For example:
      Begin HostType
      TYPENAME 
      DEFAULT 
      CRAYJ
      LINUX86
      ...
      End HostType
    2. In the HostModel section, enter the new host model with architecture and CPU factor. Use the architecture detected with lim -t. Add the host model to the end of the host model list. The limit for host model entries is 127. Lines commented out with # are not counted in the 127-line limit. For example:
      Begin HostModel
      MODELNAME   CPUFACTOR     ARCHITECTURE # keyword
      Ultra2      20             SUNWUltra2_200_sparcv9
      End HostModel
  3. Save changes to lsf.shared.
  4. Run lsadmin reconfig to reconfigure LIM.
  5. Wait a few seconds, and run lim -t again to check the type and model of the host.