Most problems are due to incorrect installation or configuration.
Often the log message points directly to the problem.
lsadmin ckconfig -v
This displays most configuration errors. If this does not report any errors, check in the LIM error log.
Sometimes the LIM is up, but executing the lsload command prints the following error message:
Communication time out.
If the LIM has just been started, this is normal, because the LIM needs time to get initialized by reading configuration files and contacting other LIMs. If the LIM does not become available within one or two minutes, check the LIM error log for the host you are working on.
To prevent communication timeouts when starting or restarting the local LIM, define the parameter LSF_SERVER_HOSTS in the lsf.conf file. The client will contact the LIM on one of the LSF_SERVER_HOSTS and execute the command, provided that at least one of the hosts defined in the list has a LIM that is up and running.
When the local LIM is running but there is no master LIM in the cluster, LSF applications display the following message:
Cannot locate master LIM now, try later.
Sometimes the master LIM is up, but executing the lsload or lshosts command prints the following error message:
Master LIM is down; try later
If the /etc/hosts file on the host where the master LIM is running is configured with the host name assigned to the loopback IP address (127.0.0.1), LSF client LIMs cannot contact the master LIM. When the master LIM starts up, it sets its official host name and IP address to the loopback address. Any client requests will get the master LIM address as 127.0.0.1, and try to connect to it, and in fact will try to access itself.
127.0.0.1 localhost myhostname
The following example correctly sets the master LIM IP address:
127.0.0.1 localhost
192.168.123.123 myhostname
::1
::1 localhost ipv6-localhost ipv6-loopback
fe00::0 ipv6-localnet
ff00::0 ipv6-mcastprefix
ff02::1 ipv6-allnodes
ff02::2 ipv6-allrouters
ff02::3 ipv6-allhosts
If remote execution fails with the following error message, the remote host could not securely determine the user ID of the user requesting remote execution.
User permission denied.
A command may fail with the following error message due to a non-uniform file name space.
chdir(...) failed: no such file or directory
You are trying to execute a command remotely, where either your current working directory does not exist on the remote host, or your current working directory is mapped to a different name on the remote host.
badmin ckconfig
This reports most errors. You should also check if there is any email in the LSF administrator’s mailbox. If the mbatchd is running but the sbatchd dies on some hosts, it may be because mbatchd has not been configured to use those hosts.
LSF uses process groups to keep track of all the processes of a job. See Process tracking through cgroups for more details.
mbatchd allows sbatchd to run only on the hosts that are listed in the Host section of the lsb.hosts file. If you try to configure an unknown host in the HostGroup or HostPartition sections of the lsb.hosts file, or as a HOSTS definition for a queue in the lsb.queues file, mbatchd logs the following message.
mbatchd on host: LSB_CONFDIR/cluster1/configdir/file(line #): Host hostname is not used by lsbatch; ignored
If you start sbatchd on a host that is not known by mbatchd, mbatchd rejects the sbatchd. The sbatchd logs the following message and exits.
This host is not used by lsbatch system.
lsadmin reconfig
badmin reconfig
lshosts
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES
hostA UNKNOWN Ultra2 20.2 2 256M 710M Yes ()
If you see DEFAULT in lim -t, it means that automatic detection of host type or model has failed, and the host type configured in lsf.shared cannot be found. LSF will work on the host, but a DEFAULT model may be inefficient because of incorrect CPU factors. A DEFAULT type may also cause binary incompatibility because a job from a DEFAULT host type can be migrated to anotherDEFAULT host type.
lshosts
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES
hostA DEFAULT DEFAULT 1 2 256M 710M Yes ()
If model is DEFAULT, LSF will work correctly but the host will have a CPU factor of 1, which may not make efficient use of the host model.
If type is DEFAULT, there may be binary incompatibility. For example, there are two hosts, one is Solaris, the other is HP. If both hosts are set to type DEFAULT, it means that jobs running on the Solaris host can be migrated to the HP host and vice-versa.