Error messages

The following error messages are logged by the LSF daemons, or displayed by the following commands.

lsadmin ckconfig
badmin ckconfig 

General errors

The messages listed in this section may be generated by any LSF daemon.

can’t open file: error

The daemon could not open the named file for the reason given by error. This error is usually caused by incorrect file permissions or missing files. All directories in the path to the configuration files must have execute (x) permission for the LSF administrator, and the actual files must have read (r) permission. Missing files could be caused by incorrect path names in the lsf.conf file, running LSF daemons on a host where the configuration files have not been installed, or having a symbolic link pointing to a nonexistent file or directory.

file(line): malloc failed

Memory allocation failed. Either the host does not have enough available memory or swap space, or there is an internal error in the daemon. Check the program load and available swap space on the host; if the swap space is full, you must add more swap space or run fewer (or smaller) programs on that host.

auth_user: getservbyname(ident/tcp) failed: error; ident must be registered in services

LSF_AUTH=ident is defined in the lsf.conf file, but the ident/tcp service is not defined in the services database. Add ident/tcp to the services database, or remove LSF_AUTH from the lsf.conf file and setuid root those LSF binaries that require authentication.

auth_user: operation(<host>/<port>) failed: error

LSF_AUTH=ident is defined in the lsf.conf file, but the LSF daemon failed to contact the identd daemon on host. Check that identd is defined in inetd.conf and the identd daemon is running on host.

auth_user: Authentication data format error (rbuf=<data>) from <host>/<port>

auth_user: Authentication port mismatch (...) from <host>/<port>

LSF_AUTH=ident is defined in the lsf.conf file, but there is a protocol error between LSF and the ident daemon on host. Make sure that the ident daemon on the host is configured correctly.

userok: Request from bad port (<port_number>), denied

LSF_AUTH is not defined, and the LSF daemon received a request that originates from a non-privileged port. The request is not serviced.

Set the LSF binaries to be owned by root with the setuid bit set, or define LSF_AUTH=ident and set up an ident server on all hosts in the cluster. If the binaries are on an NFS-mounted file system, make sure that the file system is not mounted with the nosuid flag.

userok: Forged username suspected from <host>/<port>: <claimed_user>/<actual_user>

The service request claimed to come from user claimed_user but ident authentication returned that the user was actually actual_user. The request was not serviced.

userok: ruserok(<host>,<uid>) failed

LSF_USE_HOSTEQUIV is defined in the lsf.conf file, but host has not been set up as an equivalent host (see /etc/host.equiv), and user uid has not set up a .rhosts file.

init_AcceptSock: RES service(res) not registered, exiting

init_AcceptSock: res/tcp: unknown service, exiting

initSock: LIM service not registered.

initSock: Service lim/udp is unknown. Read LSF Guide for help

get_ports: <serv> service not registered

The LSF services are not registered.

init_AcceptSock: Can’t bind daemon socket to port <port>: error, exiting

init_ServSock: Could not bind socket to port <port>: error

These error messages can occur if you try to start a second LSF daemon (for example, RES is already running, and you execute RES again). If this is the case, and you want to start the new daemon, kill the running daemon or use the lsadmin or badmin commands to shut down or restart the daemon.

Configuration errors

The messages listed in this section are caused by problems in the LSF configuration files. General errors are listed first, and then errors from specific files.

file(line): Section name expected after Begin; ignoring section

file(line): Invalid section name name; ignoring section

The keyword begin at the specified line is not followed by a section name, or is followed by an unrecognized section name.

file(line): section section: Premature EOF

The end of file was reached before reading the end section line for the named section.

file(line): keyword line format error for section section; Ignore this section

The first line of the section should contain a list of keywords. This error is printed when the keyword line is incorrect or contains an unrecognized keyword.

file(line): values do not match keys for section section; Ignoring line

The number of fields on a line in a configuration section does not match the number of keywords. This may be caused by not putting () in a column to represent the default value.

file: HostModel section missing or invalid

file: Resource section missing or invalid

file: HostType section missing or invalid

The HostModel, Resource, or HostType section in the lsf.shared file is either missing or contains an unrecoverable error.

file(line): Name name reserved or previously defined. Ignoring index

The name assigned to an external load index must not be the same as any built-in or previously defined resource or load index.

file(line): Duplicate clustername name in section cluster. Ignoring current line

A cluster name is defined twice in the same lsf.shared file. The second definition is ignored.

file(line): Bad cpuFactor for host model model. Ignoring line

The CPU factor declared for the named host model in the lsf.shared file is not a valid number.

file(line): Too many host models, ignoring model name

You can declare a maximum of 127 host models in the lsf.shared file.

file(line): Resource name name too long in section resource. Should be less than 40 characters. Ignoring line

The maximum length of a resource name is 39 characters. Choose a shorter name for the resource.

file(line): Resource name name reserved or previously defined. Ignoring line.

You have attempted to define a resource name that is reserved by LSF or already defined in the lsf.shared file. Choose another name for the resource.

file(line): illegal character in resource name: name, section resource. Line ignored.

Resource names must begin with a letter in the set [a-zA-Z], followed by letters, digits, or underscores [a-zA-Z0-9_].

LIM messages

The following messages are logged by the LIM:

findHostbyAddr/<proc>: Host <host>/<port> is unknown by <myhostname>

function: Gethostbyaddr_(<host>/<port>) failed: error

main: Request from unknown host <host>/<port>: error

function: Received request from non-LSF host <host>/<port>

The daemon does not recognize host. The request is not serviced. These messages can occur if host was added to the configuration files, but not all the daemons have been reconfigured to read the new information. If the problem still occurs after reconfiguring all the daemons, check whether the host is a multi-addressed host.

rcvLoadVector: Sender (<host>/<port>) may have different config?

MasterRegister: Sender (host) may have different config?

LIM detected inconsistent configuration information with the sending LIM. Run the following command so that all the LIMs have the same configuration information.

lsadmin reconfig

Note any hosts that failed to be contacted.

rcvLoadVector: Got load from client-only host <host>/<port>. Kill LIM on <host>/<port>

A LIM is running on a client host. Run the following command, or go to the client host and kill the LIM daemon.

lsadmin limshutdown host

saveIndx: Unknown index name <name> from ELIM

LIM received an external load index name that is not defined in the lsf.shared file. If name is defined in lsf.shared, reconfigure the LIM. Otherwise, add name to the lsf.shared file and reconfigure all the LIMs.

saveIndx: ELIM over-riding value of index <name>

This is a warning message. The ELIM sent a value for one of the built-in index names. LIM uses the value from ELIM in place of the value obtained from the kernel.

getusr: Protocol error numIndx not read (cc=num): error

getusr: Protocol error on index number (cc=num): error

Protocol error between ELIM and LIM.

RES messages

These messages are logged by the RES.

doacceptconn: getpwnam(<username>@<host>/<port>) failed: error

doacceptconn: User <username> has uid <uid1> on client host <host>/<port>, uid <uid2> on RES host; assume bad user

authRequest: username/uid <userName>/<uid>@<host>/<port> does not exist

authRequest: Submitter’s name <clname>@<clhost> is different from name <lname> on this host

RES assumes that a user has the same userID and username on all the LSF hosts. These messages occur if this assumption is violated. If the user is allowed to use LSF for interactive remote execution, make sure the user’s account has the same userID and username on all LSF hosts.

doacceptconn: root remote execution permission denied

authRequest: root job submission rejected

Root tried to execute or submit a job but LSF_ROOT_REX is not defined in the lsf.conf file.

resControl: operation permission denied, uid = <uid>

The user with user ID uid is not allowed to make RES control requests. Only the LSF manager, or root if LSF_ROOT_REX is defined in lsf.conf, can make RES control requests.

resControl: access(respath, X_OK): error

The RES received a reboot request, but failed to find the file respath to re-execute itself. Make sure respath contains the RES binary, and it has execution permission.

mbatchd and sbatchd messages

The following messages are logged by the mbatchd and sbatchd daemons:

renewJob: Job <jobId>: rename(<from>,<to>) failed: error

mbatchd failed in trying to re-submit a rerunnable job. Check that the file from exists and that the LSF administrator can rename the file. If from is in an AFS directory, check that the LSF administrator’s token processing is properly setup.

logJobInfo_: fopen(<logdir/info/jobfile>) failed: error

logJobInfo_: write <logdir/info/jobfile> <data> failed: error

logJobInfo_: seek <logdir/info/jobfile> failed: error

logJobInfo_: write <logdir/info/jobfile> xdrpos <pos> failed: error

logJobInfo_: write <logdir/info/jobfile> xdr buf len <len> failed: error

logJobInfo_: close(<logdir/info/jobfile>) failed: error

rmLogJobInfo: Job <jobId>: can’t unlink(<logdir/info/jobfile>): error

rmLogJobInfo_: Job <jobId>: can’t stat(<logdir/info/jobfile>): error

readLogJobInfo: Job <jobId> can’t open(<logdir/info/jobfile>): error

start_job: Job <jobId>: readLogJobInfo failed: error

readLogJobInfo: Job <jobId>: can’t read(<logdir/info/jobfile>) size size: error

initLog: mkdir(<logdir/info>) failed: error

<fname>: fopen(<logdir/file> failed: error

getElogLock: Can’t open existing lock file <logdir/file>: error

getElogLock: Error in opening lock file <logdir/file>: error

releaseElogLock: unlink(<logdir/lockfile>) failed: error

touchElogLock: Failed to open lock file <logdir/file>: error

touchElogLock: close <logdir/file> failed: error

mbatchd failed to create, remove, read, or write the log directory or a file in the log directory, for the reason given in error. Check that LSF administrator has read, write, and execute permissions on the logdir directory.

replay_newjob: File <logfile> at line <line>: Queue <queue> not found, saving to queue <lost_and_found>

replay_switchjob: File <logfile> at line <line>: Destination queue <queue> not found, switching to queue <lost_and_found>

When mbatchd was reconfigured, jobs were found in queue but that queue is no longer in the configuration.

replay_startjob: JobId <jobId>: exec host <host> not found, saving to host <lost_and_found>

When mbatchd was reconfigured, the event log contained jobs dispatched to host, but that host is no longer configured to be used by LSF.

do_restartReq: Failed to get hData of host <host_name>/<host_addr>

mbatchd received a request from sbatchd on host host_name, but that host is not known to mbatchd. Either the configuration file has been changed but mbatchd has not been reconfigured to pick up the new configuration, or host_name is a client host but the sbatchd daemon is running on that host. Run the following command to reconfigure the mbatchd or kill the sbatchd daemon on host_name.

badmin reconfig

LSF command messages

LSF daemon (LIM) not responding ... still trying

During LIM restart, LSF commands will fail and display this error message. User programs linked to the LIM API will also fail for the same reason. This message is displayed when LIM running on the master host list or server host list is restarted after configuration changes, such as adding new resources, binary upgrade, and so on.

Use LSF_LIM_API_NTRIES in lsf.conf or as an environment variable to define how many times LSF commands will retry to communicate with the LIM API while LIM is not available. LSF_LIM_API_NTRIES is ignored by LSF and EGO daemons and all EGO commands.

When LSB_API_VERBOSE=Y in lsf.conf, LSF batch commands will display the not responding retry error message to stderr when LIM is not available.

When LSB_API_VERBOSE=N in lsf.conf, LSF batch commands will not display the retry error message when LIM is not available.

Batch command client messages

LSF displays error messages when a batch command cannot communicate with mbatchd. The following table provides a list of possible error reasons and the associated error message output.

Point of failure

Possible reason

Error message output

Establishing a connection with mbatchd

mbatchd is too busy to accept new connections. The connect() system call times out.

LSF is processing your request. Please wait…

mbatchd is down or there is no process listening at either the LSB_MBD_PORT or the LSB_QUERY_PORT

LSF is down. Please wait…

mbatchd is down and the LSB_QUERY_PORT is busy

bhosts displays "LSF is down. Please wait. . ."

bjobs displays "Cannot connect to LSF. Please wait…"

Socket error on the client side

Cannot connect to LSF. Please wait…

connect() system call fails

Cannot connect to LSF. Please wait…

Internal library error

Cannot connect to LSF. Please wait…

Send/receive handshake message to/from mbatchd

mbatchd is busy. Client times out when waiting to receive a message from mbatchd.

LSF is processing your request. Please wait…

Socket read()/write() fails

Cannot connect to LSF. Please wait…

Internal library error

Cannot connect to LSF. Please wait…

EGO command messages

You cannot run the egosh command because the administrator has chosen not to enable EGO in lsf.conf: LSF_ENABLE_EGO=N.

If EGO is disabled, the egosh command cannot find ego.conf or cannot contact vemkd (not started).