Duplicate logging of event logs

To recover from server failures, host reboots, or mbatchd restarts, LSF uses information that is stored in lsb.events. To improve the reliability of LSF, you can configure LSF to maintain a copy of lsb.events to use as a backup.

If the host that contains the primary copy of the logs fails, LSF will continue to operate using the duplicate logs. When the host recovers, LSF uses the duplicate logs to update the primary copies.

How duplicate logging works

By default, the event log is located in LSB_SHAREDIR. Typically, LSB_SHAREDIR resides on a reliable file server that also contains other critical applications necessary for running jobs, so if that host becomes unavailable, the subsequent failure of LSF is a secondary issue. LSB_SHAREDIR must be accessible from all potential LSF master hosts.

When you configure duplicate logging, the duplicates are kept on the file server, and the primary event logs are stored on the first master host. In other words, LSB_LOCALDIR is used to store the primary copy of the batch state information, and the contents of LSB_LOCALDIR are copied to a replica in LSB_SHAREDIR, which resides on a central file server. This has the following effects:
  • Creates backup copies of lsb.events

  • Reduces the load on the central file server

  • Increases the load on the LSF master host

Failure of file server

If the file server containing LSB_SHAREDIR goes down, LSF continues to process jobs. Client commands such as bhist, which directly read LSB_SHAREDIR will not work.

When the file server recovers, the current log files are replicated to LSB_SHAREDIR.

Failure of first master host

If the first master host fails, the primary copies of the files (in LSB_LOCALDIR) become unavailable. Then, a new master host is selected. The new master host uses the duplicate files (in LSB_SHAREDIR) to restore its state and to log future events. There is no duplication by the second or any subsequent LSF master hosts.

When the first master host becomes available after a failure, it will update the primary copies of the files (in LSB_LOCALDIR) from the duplicates (in LSB_SHAREDIR) and continue operations as before.

If the first master host does not recover, LSF will continue to use the files in LSB_SHAREDIR, but there is no more duplication of the log files.

Simultaneous failure of both hosts

If the master host containing LSB_LOCALDIR and the file server containing LSB_SHAREDIR both fail simultaneously, LSF will be unavailable.

Network partioning

We assume that Network partitioning does not cause a cluster to split into two independent clusters, each simultaneously running mbatchd.

This may happen given certain network topologies and failure modes. For example, connectivity is lost between the first master, M1, and both the file server and the secondary master, M2. Both M1 and M2 will run mbatchd service with M1 logging events to LSB_LOCALDIR and M2 logging to LSB_SHAREDIR. When connectivity is restored, the changes made by M2 to LSB_SHAREDIR will be lost when M1 updates LSB_SHAREDIR from its copy in LSB_LOCALDIR.

The archived event files are only available on LSB_LOCALDIR, so in the case of network partitioning, commands such as bhist cannot access these files. As a precaution, you should periodically copy the archived files from LSB_LOCALDIR to LSB_SHAREDIR.

Automatic archives

Archived event logs, lsb.events.n, are not replicated to LSB_SHAREDIR. If LSF starts a new event log while the file server containing LSB_SHAREDIR is down, you might notice a gap in the historical data in LSB_SHAREDIR.