Release date: July 2014
Last modified: 18 July 2014
Information about IBM Platform LSF (Platform LSF or LSF) is available from the following sources:
Platform LSF is available through a variety of channels and a variety of formats.
The IBM Knowledge Center is the home for IBM product documentation. Find Platform LSF documentation in the IBM Knowledge Center on the IBM Web site: www.ibm.com/support/knowledgecenter/SSETD4/.
Search all the content in IBM Knowledge Center for subjects that interest you, or search within a product, or restrict your search to one version of a product. Sign in with your IBM ID to take full advantage of the personalization features available in IBM Knowledge Center. Create and print custom collections of documents you use regularly, and communicate with colleagues and IBM by adding comments to topics.
Documentation available through the IBM Knowledge Center may be updated and regenerated following the original release of Platform LSF 9.1.3.
You can download, extract and install these packages to any server on your system to have a local version of the full LSF documentation set. Navigate to the location where you extracted the files and open index.html in any browser. Easy access to each document in PDF and HTML format is provided, as well as full search capabilities within the full documentation set or within a specific document type.
If you have installed IBM Platform Application Center (PAC), you can access and search the LSF documentation through the Help link in the user interface.
Platform LSF documentation is also available in PDF format On the IBM Publications Center: www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss.
LSF documentation in PDF format is also available for Version 9.1.2 and earlier releases on the IBM Support Portal: http://www.ibm.com/support/customercare/sas/f/plcomp/platformlsf.html.
Connect. Learn. Share. Collaborate and network with the IBM Platform Computing experts at the IBM Technical Computing community. Access the Technical Computing community on IBM Service Management Connect at http://www.ibm.com/developerworks/servicemanagement/tc/. Join today!
Contact IBM or your LSF vendor for technical support.
Or go to the IBM Support Portal: www.ibm.com/support
If you find an error in any Platform Computing documentation, or you have a suggestion for improving it, please let us know.
In the IBM Knowledge Center, add your comments and feedback to any topic.
You can also send your suggestions, comments and questions to the following email address:
Be sure include the publication title and order number, and, if applicable, the specific location of the information about which you have comments (for example, a page number or a browser URL). When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you.
The following sections detail requirements and compatibility for version 9.1.3 of Platform LSF.
For detailed LSF system support information, refer to the Compatibility Table at:
www.ibm.com/systems/technicalcomputing/platformcomputing/products/lsf/
To achieve the highest degree of performance and scalability, use a powerful master host.
There is no minimum CPU requirement. For the platforms on which LSF is supported, any host with sufficient physical memory can run LSF as master host. Swap space is normally configured as twice the physical memory. LSF daemons use about 40 MB of memory when no jobs are running. Active jobs consume most of the memory LSF requires.
Cluster size |
Active jobs |
Minimum required memory (typical) |
Recommended server CPU (Intel, AMD, OpenPower or equivalent) |
---|---|---|---|
Small (<100 hosts) |
1,000 |
1 GB (32 GB) |
any server CPU |
|
10,000 |
2 GB (32 GB) |
recent server CPU |
Medium (100-1000 hosts) |
10,000 |
4 GB (64 GB) |
multi-core CPU (2 cores) |
|
50,000 |
8 GB (64 GB) |
multi-core CPU (4 cores) |
Large (>1000 hosts) |
50,000 |
16 GB (128 GB) |
multi-core CPU (4 cores) |
|
500,000 |
32 GB (256 GB) |
multi-core CPU (8 cores) |
Platform LSF 7.x, 8.0.x, 8.3, and 9.1.x, servers are compatible with Platform LSF 9.1.3 master hosts. All LSF 7.x, 8.0.x, 8.3, and 9.1.x features are supported by Platform LSF 9.1.3 master hosts.
Customers can use IBM Platform RTM (Platform RTM) 8.3 or 9.1.x to collect data from Platform LSF 9.1.3 clusters. When adding the cluster, select 'Poller for LSF 8' or 'Poller for LSF 9.1'.
IBM Platform License Scheduler (License Scheduler) 8.3 and 9.1.x are compatible with Platform LSF 9.1.3.
IBM Platform Analytics (Analytics) 8.3 and 9.1.x are compatible with Platform LSF 9.1.3 after the following manual configuration:
cp ANALYTICS_TOP/elim/os_type/elim.coreutil $LSF_SERVERDIR
Begin Resource
RESOURCENAME TYPE INTERVAL INCREASING DESCRIPTION
CORE_UTIL String 300 () (Core Utilization)
End Resource
Begin ResourceMap
RESOURCENAME LOCATION
CORE_UTIL [default]
End ResourceMap
IBM Platform Application Center (PAC) 8.3 and higher versions are compatible with Platform LSF 9.1.x after the following manual configuration.
If you are using PAC 8.3 with LSF 9.1.x, $PAC_TOP/perf/lsf/8.3 must be renamed to $PAC_TOP/perf/lsf/9.1
For example:
mv /opt/pac/perf/lsf/8.3 /opt/pac/perf/lsf/9.1
To take full advantage of new Platform LSF 9.1.3 features, recompile your existing Platform LSF applications with Platform LSF 9.1.3.
Applications need to be rebuilt if they use APIs that have changed in Platform LSF 9.1.3.
For detailed information about APIs changed or created for LSF 9.1.3, refer to the IBM Platform LSF 9.1.3 API Reference.
Packages are available at www.github.com.
For more information on using third party APIs with LSF 9.1.3 see the Technical Computing community on IBM Service Management Connect at www.ibm.com/developerworks/servicemanagement/tc/plsf/index.html.
Consult the following note on installing and migrating from a previous version of LSF.
To migrate an existing LSF 7 Windows cluster to Platform LSF 9.1.3 on Windows, follow the steps in Migrating IBM Platform LSF Version 7 to IBM Platform LSF Version 9.1.3 on Windows.
LSF Express Edition is a solution for Linux customers with simple scheduling requirements and simple fairshare setup. Smaller clusters typically have a mix of sequential and parallel work as opposed to huge volumes of jobs. For this reason, several performance enhancements and complex scheduling policies designed for large-scale clusters are not applicable to LSF Express Edition clusters. Session Scheduler is available as an add-on component.
The following IBM Platform products are supported in LSF Express Edition:
The following IBM Platform products are not supported in LSF Express Edition:
The following table lists the configuration enforced in LSF Express Edition:
Parameter |
Setting |
Description |
---|---|---|
RESIZABLE_JOBS in lsb.applications |
N |
If enabled, all jobs belonging to the application will be auto resizable. |
EXIT_RATE in lsb.hosts |
Not defined |
Specifies a threshold for exited jobs. |
BJOBS_RES_REQ_DISPLAY in lsb.params |
None |
Controls how many levels of resource requirements bjobs –l will display. |
CONDENSE_PENDING_REASONS in lsb.params |
N |
Condenses all host-based pending reasons into one generic pending reason. |
DEFAULT_JOBGROUP in lsb.params |
Disabled |
The name of the default job group. |
EADMIN_TRIGGER_DURATION in lsb.params |
1 minute |
Defines how often LSF_SERVERDIR/eadmin is invoked once a job exception is detected. Used in conjunction with job exception handling parameters JOB_IDLE, JOB_OVERRUN, and JOB_UNDERRUN in lsb.queues. |
ENABLE_DEFAULT_EGO_SLA in lsb.params |
Not defined |
The name of the default service class or EGO consumer name for EGO-enabled SLA scheduling. |
EVALUATE_JOB_DEPENDENCY in lsb.params |
Unlimited |
Sets the maximum number of job dependencies mbatchd evaluates in one scheduling cycle. |
GLOBAL_EXIT_RATE in lsb.params |
2147483647 |
Specifies a cluster-wide threshold for exited jobs |
JOB_POSITION_CONTROL_BY_ADMIN in lsb.params |
Disabled |
Allows LSF administrators to control whether users can use btop and bbot to move jobs to the top and bottom of queues. |
LSB_SYNC_HOST_STAT_FROM_LIM in lsb.params |
N |
Improves the speed with which mbatchd obtains host status, and therefore the speed with which LSF reschedules rerunnable jobs. This parameter is most useful for a large clusters, so it is disabled for LSF Express Edition. |
MAX_CONCURRENT_QUERY in lsb.params |
100 |
Controls the maximum number of concurrent query commands. |
MAX_INFO_DIRS in lsb.params |
Disabled |
The number of subdirectories under the LSB_SHAREDIR/cluster_name/logdir/info directory. |
MAX_JOBID in lsb.params |
999999 |
The job ID limit. The job ID limit is the highest job ID that LSF will ever assign, and also the maximum number of jobs in the system. |
MAX_JOB_NUM in lsb.params |
1000 |
The maximum number of finished jobs whose events are to be stored in lsb.events. |
MIN_SWITCH_PERIOD in lsb.params |
Disabled |
The minimum period in seconds between event log switches. |
MBD_QUERY_CPUS in lsb.params |
Disabled |
Specifies the master host CPUs on which mbatchd child query processes can run (hard CPU affinity). |
NO_PREEMPT_INTERVAL in lsb.params |
0 |
Prevents preemption of jobs for the specified number of minutes of uninterrupted run time, where minutes is wall-clock time, not normalized time. |
NO_PREEMPT_RUN_TIME in lsb.params |
-1 (not defined) |
Prevents preemption of jobs that have been running for the specified number of minutes or the specified percentage of the estimated run time or run limit. |
PREEMPTABLE_RESOURCES in lsb.params |
Not defined |
Enables preemption for resources (in addition to slots) when preemptive scheduling is enabled (has no effect if queue preemption is not enabled) and specifies the resources that will be preemptable. |
PREEMPT_FOR in lsb.params |
0 |
If preemptive scheduling is enabled, this parameter is used to disregard suspended jobs when determining if a job slot limit is exceeded, to preempt jobs with the shortest running time, and to optimize preemption of parallel jobs. |
SCHED_METRIC_ENABLE in lsb.params |
N |
Enables scheduler performance metric collection. |
SCHED_METRIC_SAMPLE_PERIOD in lsb.params |
Disabled |
Performance metric sampling period. |
SCHEDULER_THREADS in lsb.params |
0 |
Sets the number of threads the scheduler uses to evaluate resource requirements. |
DISPATCH_BY_QUEUE in lsb.queues |
N |
Increases queue responsiveness. The scheduling decision for the specified queue will be published without waiting for the whole scheduling session to finish. The scheduling decision for the jobs in the specified queue is final and these jobs cannot be preempted within the same scheduling cycle. |
LSB_JOBID_DISP_LENGTH in lsf.conf |
Not defined |
By default, LSF commands bjobs and bhist display job IDs with a maximum length of 7 characters. Job IDs greater than 9999999 are truncated on the left. When LSB_JOBID_DISP_LENGTH=10, the width of the JOBID column in bjobs and bhist increases to 10 characters. |
LSB_FORK_JOB_REQUEST in lsf.conf |
N |
Improves mbatchd response time after mbatchd is restarted (including parallel restart) and has finished replaying events. |
LSB_MAX_JOB_DISPATCH_PER_SESSION in lsf.conf |
300 |
Defines the maximum number of jobs that mbatchd can dispatch during one job scheduling session. |
LSF_PROCESS_TRACKING in lsf.conf |
N |
Tracks processes based on job control functions such as termination, suspension, resume and other signaling, on Linux systems which support cgroups' freezer subsystem. |
LSB_QUERY_ENH in lsf.conf |
N |
Extends multithreaded query support to batch query requests (in addition to bjobs query requests). In addition, the mbatchd system query monitoring mechanism starts automatically instead of being triggered by a query request. This ensures a consistent query response time within the system. Enables a new default setting for min_refresh_time in MBD_REFRESH_TIME (lsb.params). |
LSB_QUERY_PORT in lsf.conf |
Disabled |
Increases mbatchd performance when using the bjobs command on busy clusters with many jobs and frequent query request. |
LSF_LINUX_CGROUP_ACCT in lsf.conf |
N |
Tracks processes based on CPU and memory accounting for Linux systems that support cgroup's memory and cpuacct subsystems. |
The entitlement file for the edition you use must be installed as LSF_TOP/conf/lsf.entitlement.
If you have installed LSF Express Edition, you can upgrade later to LSF Standard Edition or LSF Advanced Edition to take advantage of the additional functionality. Simply reinstall the cluster with the LSF Standard entitlement file (platform_lsf_std_entitlement.dat) or the LSF Advanced entitlement file (platform_lsf_adv_entitlement.dat).
You can also manually upgrade from LSF Express Edition to Standard Edition or Advanced Edition. Get the LSF Standard or Advanced Edition entitlement file, copy it to LSF_TOP/conf/lsf.entitlement and restart you cluster. The new entitlement enables the additional functionality of LSF Standard Edition, but you may need to manually change some of the default LSF Express configuration parameters to use the LSF Standard or Advanced features.
To take advantage of LSF SLA features in LSF Standard Edition, copy LSF_TOP/LSF_VERSION/install/conf_tmpl/lsf_standard/lsb.serviceclasses into LSF_TOP/conf/lsbatch/LSF_CLUSTERNAME/configdir/.
Once LSF is installed and running, run the lsid command to see which edition of LSF is enabled.
The following topics detail new and changed behavior, new and changed commands, options, output, configuration parameters, environment variables, accounting and job event fields.
The following details changes to default LSF behavior.
To keep up with the increasing density of hosts (cores/threads per node) and the growth in threaded applications (for example, a job may request 4 slots, then run 4 threads per slot, so in reality it is using more than 4 cores) there is greater disparity between what a user requests and what needs to be allocated to satisfy the request. This is particularly true in HPC environments where exclusive allocation of nodes is more prevalent.
For consistency, the “slot” concept in LSF has been superseded by “task”. In the first example above, a job running 4 processes each with 4 threads is 16 tasks, and with one task per core, it requires 16 cores to run.
A new parameter, LSB_ENABLE_HPC_ALLOCATION in lsf.conf is introduced. For new installations, this parameter will be enabled automatically (set to Y). For upgrades, it will be set to N and must be enabled manually.
When set to Y|y, this parameter changes the concept of the required number of slots for a job to the required number of tasks for a job. The specified numbers of tasks (using bsub), will be the number of tasks to launch on execution hosts. The allocated slots will change to all slots on the allocated execution hosts for an exclusive job in order to reflect the actual slot allocation.
The following details new and changed behavior for LSF 9.1.3.
Specifying a list of allowed job sizes (number of tasks) in queues or application profiles enables LSF to check the requested job sizes when submitting, modifying, or switching jobs.
Certain applications may yield better performance with specific job sizes (for example, the power of two, so that the job sizes are x^2). The JOB_SIZE_LIST parameter in lsb.queues or lsb.applications defines a discrete list of allowed job sizes for the specified queues or application profiles. LSF will reject jobs requesting job sizes that are not in this list, or jobs requesting multiple job sizes.
The first job size in the JOB_SIZE_LIST is the default job size, which is assigned to jobs that do not explicitly request a job size. The rest of the list can be defined in any order:
JOB_SIZE_LIST=default_size [size ...]
For example, the following defines a job size list for the queue1 queue:
Begin Queue
QUEUE_NAME = queue1
...
JOB_SIZE_LIST=4 2 8 16
...
End Queue
This job size list allows 2, 4, 8, and 16 tasks. If you submit a parallel job requesting 10 tasks in this queue (bsub -q queue1 -n 10 ...), that job is rejected because the job size of 10 is not explicitly allowed in the list. The default job size is 4 tasks, and job submissions that do not request a job size are automatically assigned a job size of 4.
When using resource requirements to specify job size, the request must specify a single fixed job size and not multiple values or a range of values:
When defined in both a queue (lsb.queues) and an application profile (lsb.applications), the job size request must satisfy both requirements. In addition, JOB_SIZE_LIST overrides any TASKLIMIT (TASKLIMIT replaces PROCLIMIT in LSF 9.1.3) parameters defined at the same level.
Often, complex workflows are required with job dependencies for proper job sequencing as well as job failure handling. For a given job, called the parent job, there can be child jobs which depend on its state before they can start. If one or more conditions are not satisfied, a child job remains pending. However, if the parent job is in a state such that the event on which the child depends will never occur, the child becomes an orphan job. For example, if a child job has a DONE dependency on the parent job but the parent ends abnormally, the child will never run as a result of the parent’s completion and it becomes an orphan job.
Keeping orphan jobs in the system can cause performance degradation. The pending orphan jobs consume unnecessary system resources and add unnecessary loads to the daemons which can impact their ability to do useful work.
When submitting a job, you can point the job to a file that specifies hosts and number of slots for job processing.
For example, some applications (typically when benchmarking) run best with a very specific geometry. For repeatability (again, typically when benchmarking) you may want it to always run it on the same hosts, using the same number of slots.
The user-specified host file specifies a host and number of slots to use per task, resulting in a rank file.
The -hostfile option allows a user to submit a job, specifying the path of the user-specified host file:
bsub -hostfile "spec_host_file"
Any user can create a user-specified host file. It must be accessible by the user from the submission host. It lists one host per line. The format is as follows:
# This is a user-specified host file
<host_name1> [<# slots>]
<host_name2> [<# slots>]
<host_name1> [<# slots>]
<host_name2> [<# slots>]
<host_name3> [<# slots>]
<host_name4> [<# slots>]
#first three tasks
host01 3
#fourth tasks
host02
#next three tasks
host03 3
The resulting rank file is made available to other applications (such as MPI).
The LSB_DJOB_RANKFILE environment variable is generated from the user-specified host file. If a job is not submitted with a user-specified host file then LSB_DJOB_RANKFILE points to the same file as LSB_DJOB_HOSTFILE.
The esub parameter LSB_SUB4_HOST_FILE reads and modifies the value of the -hostfile option.
Use bsub -hostfile (or bmod -hostfile for a pending job) to enter the location of a user-specified host file containing a list of hosts and slots on those hosts. The job will dispatch on the specified allocation once those resources become available.
Use bmod -hostfilen to remove the hostfile option from a job.
bjobs -l and bhist -l show the host allocation for a given job.
Use -hostfile together with -l or -UF, to view the user-specified host file content as well.
The following are restrictions on the usage of the -hostfile option:
The new parameter LSB_MEMLIMIT_ENF_CONTROL in lsf.conf further refines the behavior of enforcing a job memory limit for a host. In the case that one or more jobs reach a specified memory limit for the host (both the host memory and swap utilization has reached a configurable threshold) at execution time, the worst offending job on the host will be killed. A job is selected as the worst offending job on that host if it has the most overuse of memory (actual memory rusage minus memory limit of the job).
You also have the choice of killing all jobs exceeding the thresholds (not just the worst).
For a description of usage and restrictions on this parameter, see LSB_MEMLIMIT_ENF_CONTROL.
LSF can now impose strict job-level host-based memory and swap limits on systems that support Linux cgroups. When LSB_RESOURCE_ENFORCE="memory" is set, memory and swap limits are calculated and enforced as a multiple of the number of tasks running on the execution host when memory and swap limits are specified for the job (at the job-level with -M and -v, or in lsb.queues or lsb.applications with MEMLIMIT and SWAPLIMIT).
The new bsub -hl option enables job-level (irrespective of the number of tasks) host-based memory and swap limit enforcement regardless of the number of tasks running on the execution host. LSB_RESOURCE_ENFORCE="memory" must be specified in lsf.conf for host-level memory and swap limit enforcement with the -hl option to take effect. If no memory or swap limit is specified for the job (the merged limit for the job, queue, and application profile, if specified), or LSB_RESOURCE_ENFORCE="memory" is not specified, a host-based memory limit is not set for the job. The -hl option only applies to memory and swap limits; it does not apply to any other resource usage limits.
See Administering IBM Platform LSF for more information about memory and swap resource usage limits, and memory enforcement based on Linux cgroup memory subsystem.
When a job’s pre-execution fails, the job will be requeued and tried again. When the pre-exec has failed a defined number of times (LOCAL_MAX_PREEXEC_RETRY in lsb.params, lsb.queues, or lsb.applications) LSF suspends the job and places it in the PSUSP state. If this is a common occurrence, a large number of PSUSP jobs can quickly fill the system, leading to both usability issues and system degradation.
In this release, a pre-execution retry threshold is introduced so that a job exits once the pre-execution has failed a specified number of times. You can setLOCAL_MAX_PREEXEC_RETRY_ACTION cluster-wide in lsb.params, at the queue level in lsb.queues, or at the application level in lsb.applications. The default behavior specified in lsb.applications overrides lsb.queues, and lsb.queues overrides the lsb.params configuration.
Set LOCAL_MAX_PREEXEC_RETRY_ACTION=EXIT to have the job exit and to have LSF sets its status to EXIT. The job exits with the same exit code as the last pre-execution fail exit code.
In the MultiCluster job forwarding model, the local cluster now considers the application profile or receive queue's TASKLIMIT setting on remote clusters before forwarding jobs. This reduces the number of forwarded jobs that stay pending before returning to the submission cluster due to the remote cluster's TASKLIMIT settings being unable to satisfy the job's task requirements. By considering the TASKLIMIT settings in the remote clusters, jobs are no longer forwarded to remote clusters that cannot run these jobs due to task requirements.
If the receive queue's TASKLIMIT definition in the remote cluster cannot satisfy the job's task requirements, the job is not forwarded to that remote queue. Likewise, if the application profile's TASKLIMIT definition in the remote cluster cannot satisfy the job's task requirements, the job is not forwarded to that cluster.
Advance reservation requests can be made on a unit of hosts by specifying the host requirements such as the number of hosts, the candidate host list, and/or the resource requirement for the candidate hosts. LSF creates the host-based advance reservation based on these requirements. Each reserved host is reserved in its entirety and cannot be reserved again nor can it be used by other jobs outside the advance reservation during the time it is dedicated to the advance reservation. If MXJ (in lsb.hosts) is undefined for a host, a host-based reservation reserves all CPUs on that host.
The command option -unit is introduced to brsvadd to indicate either slot or host for the advance reservation:
brsvadd -unit [slot | host]
If -unit is not specified for brsvadd, the advance reservation request will use the slot unit by default.
With either slot-based or host-based advance reservation, the request must specify the following:
The commands brsvmod addhost and brsvmod rmhost expand to include both slots or hosts, depending on the unit originally specified for the advance reservation through the command brsvadd -unit.
An advance reservation request may specify a list of user and user group names. Each user or user group specified may run jobs for that advance reservation. Multiple users or user groups can be specified for an advance reservation using the brsvmod command:
brsvmod -u "user_name | user_group" replaces an advance reservation’s list of users and user groups.
If the advance reservation was created with the -g option, brsvmod cannot switch the advance reservation type from group to user. In this case, use brsvmod -u can be used to replace the entire list of users and user groups.
The Resource Unit (Slot or Host) specified for an advance reservation (with the -l option).
When using bsub and tssub to submit jobs, you can use the -env option to control the propagation of job submission environment variables to the execution hosts:
-env "none" | "all [, ~var_name[, ~var_name] ...] [, var_name=var_value[, var_name=var_value] ...]" | "var_name[=var_value][, var_name[=var_value] ...]"Specify a comma-separated list of environment variables. Controls the propagation of the specified job submission environment variables to the execution hosts.
For example,
For example, -env "all, var1=value1, var2=value2" submits jobs with all the environment variables, but with the specified values for the var1 and var2 environment variables.
The environment variable names cannot be "none" or "all".
The environment variable names cannot contain the following symbols: comma (,), "~", "=", double quotation mark (") and single quotation mark (').
The variable value can contain a comma (,) and "~", but if it contains a comma, you must enclose the variable value in single quotation marks.
An esub can change the -env environment variables by writing them to the file specified by the LSB_SUB_MODIFY_FILE environment variable. If the LSB_SUB_MODIFY_ENVFILE environment variable is also specified and the file specified by this environment variable contains the same environment variables, the environment variables in LSF_SUB_MODIFY_FILE take effect.
When -env is not specified with bsub, the default value is -env "all" (that is, all environment variables are submitted with the default values).
The entire argument for the -env option may contain a maximum of 4094 characters for UNIX and Linux, or up to 255 characters for Windows.
If -env conflicts with -L, the value of -L takes effect.
The following environment variables are not propagated to execution hosts because they are only used in the submission host and are not used in the execution hosts:
The following environment variables do not take effect on the execution hosts: LSB_DEFAULTPROJECT, LSB_DEFAULT_JOBGROUP, LSB_TSJOB_ENVNAME, LSB_TSJOB_PASSWD, LSF_DISPLAY_ALL_TSC, LSF_JOB_SECURITY_LABEL, LSB_DEFAULT_USERGROUP, LSB_DEFAULT_RESREQ, LSB_DEFAULTQUEUE, BSUB_CHK_RESREQ, LSB_UNIXGROUP, LSB_JOB_CWD
When submitting jobs with specified input, output, and error file names (using bsub -i, -is, -o, -oo, -e, and -eo options), you can use the special characters %J and %I in the name of the files. %J is replaced by the job ID. %I is replaced by the index of the job in the array, if the job is a member of an array, or by 0 (zero) if the job is not a member of an array. When viewing job information, bjobs -o, -l, or -UF now replaces %J with the job ID and %I with the array index when displaying job file names. Previously, bjobs -o, -l, or -UF displayed these file names with %J and %I without resolving the job ID and array index values.
The documentation and online help for bjobs and bsub are now reorganized and expanded. The bjobs and bsub command options are grouped into categories, which describes the general goal or function of the command option.
In the IBM Platform LSF Command Reference documentation, the bjobs and bsub sections now list the categories, followed by the options, listed in alphabetical order. Each option lists the categories to which it belongs and includes a detailed synopsis of the command. Any conflicts that the option has with other options are also listed (that is, options that cannot be used together).
The online help in the command line for bjobs and bsub is organized by categories and allows you to view help topics for specific options in addition to viewing the entire man page for the command. To view the online help, run bjobs or bsub with the -h (or -help) option. This provides a brief description of the command and lists the categories and options that belong to the command. To view a brief description of all options, run -h all (or -help all). To view more details on the command, run -h description (or -help description). To view more information on the categories and options (in increasing detail), run -h (or -help) with the name of the category or the option:
bjobs -h[elp] [all] [description] [category_name ...] [-option_name ...]
bsub -h[elp] [all] [description] [category_name ...] [-option_name ...]
If you list multiple categories and options, the online help displays each entry in the order in which you specified the categories and options.
For example,
Gold v2.2 (or newer) is supported on Linux and UNIX. Complete the steps in LSF_INSTALLDIR/9.1/misc/examples/gold/readme.txt to install and configure the Gold integration in LSF
The following command options and output are new or changed for LSF 9.1.3
Release allocation on <num_hosts> Hosts/Processors <host_list> by user or
administrator <user_name>
Resize notification accepted;
bacct -l -aff 6
Accounting information about jobs that are:
- submitted by all users.
- accounted on all projects.
- completed normally or exited
- executed on all hosts.
- submitted to all queues.
- accounted on all service classes.
------------------------------------------------------------------------------
Job <6>, User <user1>, Project <default>, Status <DONE>, Queue <normal>, Comma
nd <myjob>
Thu Feb 14 14:13:46: Submitted from host <hostA>, CWD <$HOME>;
Thu Feb 14 14:15:07: Dispatched <num_tasks> Task(s) on Host(s) <host_list>,
Allocated <num_slots> Slot(s) on Host(s) <host_list>;
Effective RES_REQ <select[type == local] order[r15s:pg]
rusage[mem=100.00] span[hosts=1] affinity[core(1,same=
socket,exclusive=(socket,injob))*1:cpubind=socket:membind
=localonly:distribute=pack] >;
Thu Feb 14 14:16:47: Completed <done>.
Added <num_tasks> tasks on host <host_list>, <num_slots> additional slots
allocated on <host_list>
Release allocation on <num_hosts> Hosts/Processors <host_list> by user or
administrator <user_name>
Resize notification accepted;
bhist -l 749
Job <749>, User <user1>;, Project <default>, Command <my_pe_job>
Mon Jun 4 04:36:12: Submitted from host <hostB>, to Queue <priority>,
CWD <$HOME>, 2 Task(s), Requested
Network <type=sn_all:protocol=mpi:mode=US:usage=
shared:instance=1>
Mon Jun 4 04:36:15: Dispatched <num_tasks> Task(s) on Host(s) <host_list>,
Allocated <num_slots> Slot(s) on Host(s) <host_list>;
Effective RES_REQ <select[type == local] rusage[nt1=1.00] >,
PE Network ID <1111111> <2222222> used <1> window(s)
Mon Jun 4 04:36:17: Starting (Pid 21006);
Behavior change for bjobs -l: Predicted start time for PEND reserve job will not be shown with bjobs -l. LSF does not calculate predicted start time for PEND reserve job if no back fill queue is configured in the system. In that case, resource reservation for PEND jobs works as normal, and no predicted start time is calculated.
Run bjobs -h (or bjobs -help) without a command option or category name to display the bjobs command description.
bjobs -l 6
Job <6>, User <user1>, Project <default>, Status <RUN>, Queue <normal>, Comman
d <myjob1>
Thu Feb 14 14:13:46: Submitted from host <hostA>, CWD <$HOME>, 6 Tasks;
Thu Feb 14 14:15:07: Started 6 Task(s) on Host(s) <hostA> <hostA> <hostA> <hostA>
<hostA> <hostA>, Allocated 6 Slots on Hosts <hostA>
<hostA> <hostA> <hostA> <hostA> <hostA>, Execution Home
</home/user1>, Execution CWD </home/user1>;
bmod -hostfile "host_alloc_file" <job_id>
bmod -hostfilen <job_id>
The new-unit [slot | host] option specifies whether an advance reservation is for a number of slots or hosts. If -unit is not specified for brsvadd, the advance reservation request will use the slot unit by default.
The -u "user_name... | user_group ..." option has been changed so that it replaces the list of users or groups who are able to submit jobs to a reservation.
adduser -u "user_name ... | user_group ..."] reservation_ID
rmuser -u "user_name ... | user_group ..."] reservation_ID
(with the -l option), the Resource Unit (Slot or Host) specified for an advance reservation.
Behavior change for bslots: LSF does not calculate predicted start times for PEND reserve jobs if no backfill queue is configured in the systemc. In that case, the resource reservation for PEND jobs works as normal, but no predicted start time is calculated, and bslots does not show the backfill window.
bsub -hostfile "host_alloc_file" ./a.out
LSB_RESOURCE_ENFORCE="memory" must be specified in lsf.conf for host-based memory and swap limit enforcement with the -hl option to take effect. If no memory or swap limit is specified for the job (the merged limit for the job, queue, and application profile, if specified), or LSB_RESOURCE_ENFORCE="memory" is not specified, a host-based memory limit is not set for the job.
When LSB_RESOURCE_ENFORCE="memory" is configured in lsf.conf, and memory and swap limits are specified for the job, but -hl is not specified, memory and swap limits are calculated and enforced as a multiple of the number of tasks running on the execution host.
Submits a parallel job and specifies the number of tasks in the job. The number of tasks is used to allocate a number of slots for the job. Usually, the number of slots assigned to a job will equal the number of tasks specified. For example, one task will be allocated with one slot. (Some slots/processors may be on the same multiprocessor host).
Run bsub -h (or bsobs -help) without a command option or category name to display the bsub command description.
The following configuration parameters and environment variables are new or changed for LSF 9.1.3
JOB_SIZE_LIST=default_size [size ...]
LOCAL_MAX_PREEXEC_RETRY_ACTION=SUSPEND | EXIT
ORPHAN_JOB_TERM_GRACE_PERIOD = 0: Automatic orphan job termination is enabled in the cluster but no termination grace period is defined. A dependent job can be terminated as soon as it is found to be an orphan.
ORPHAN_JOB_TERM_GRACE_PERIOD > 0: Automatic orphan job termination is enabled and the termination grace period is set to the specified number of seconds. This is the minimum time LSF will wait before terminating an orphan job. In a multi-level job dependency tree, the grace period is not repeated at each level, and all direct and indirect orphans of the parent job can be terminated by LSF automatically after the grace period has expired.
ORPHAN_JOB_TERM_GRACE_PERIOD=seconds
LOCAL_MAX_PREEXEC_RETRY_ACTION=SUSPEND | EXIT
After changing this parameter, running jobs using the allocation may be re-queued.
EGO_RESOURCE_GROUP=mygroup1 mygroup4 mygroup5
JOB_SIZE_LIST=default_size [size ...]
When MEMLIMIT is defined and the job is submitted with -hl, memory limits are enforced on systems that support Linux cgroups for on a per-job and per-host basis, regardless of the number of tasks running on the execution host. LSB_RESOURCE_ENFORCE="memory" must be specified in lsf.conf for host-based memory limit enforcement with the -hl option to take effect.
LOCAL_MAX_PREEXEC_RETRY_ACTION=SUSPEND | EXIT
For new installations of LSF, LSB_ENABLE_HPC_ALLOCATION is set to Y automatically.
LSB_ENABLE_HPC_ALLOCATION=Y|y|N|n
LSB_MEMLIMIT_ENF_CONTROL=<Memory Threshold>:<Swap Threshold>:<Check Interval>:[all]
The following describes usage and restrictions on this parameter.
<Memory Threshold>: (Used memory size/maximum memory size)
A threshold indicating the maximum limit for the ratio of used memory size to maximum memory size on the host. The threshold represents a percentage and must be an integer between 1 and 100.
<Swap Threshold>: (Used swap size/maximum swap size)
A threshold indicating the maximum limit for the ratio of used swap memory size to maximum swap memory size on the host. The threshold represents a percentage and must be an integer between 1 and 100.
<Check Interval>: The value, in seconds, specifying the length of time that the host memory and swap memory usage will not be checked during the nearest two checking cycles. The value must be an integer greater than or equal to the value of SBD_SLEEP_TIME.
The keyword :all can be used to terminate all single host jobs that exceed the memory limit when the host threshold is reached. If not used, only the worst offending job is killed.
If the cgroup memory enforcement feature is enabled (LSB_RESOURCE_ENFORCE includes the keyword "memory"), LSB_MEMLIMIT_ENF_CONTROL is ignored.
The host will be considered to reach the threshold when both Memory Threshold and Swap Threshold are reached.
LSB_MEMLIMIT_ENF_CONTROL does not have any effect on jobs running across multiple hosts. They will be terminated if they are over the memory limit regardless of usage on the execution host.
On some operating systems, when the used memory equals the total memory, the OS may kill some processes. In this case, the job exceeding the memory limit may be killed by the OS not an LSF memory enforcement policy.
In this case, the exit reason of the job will indicate “killed by external signal”.
LSB_DJOB_RANKFILE=file_path
LSB_PROJECT_NAME=project_name
The following job event fields are added or changed for LSF 9.1.3.
egoshegosh: error while loading shared libraries: libstdc++.so.6:
cannot open shared object file: No such file or directory
After the 9.1.1 release of LSF, logic to handle the case that a task exits slowly on other execution nodes when LSF crashes on the first execution node was introduced. The LSF_RES_ALIVE_TIMEOUT parameter was introduced to control if those tasks exit directly on nodes other than the first node. LSF res report task usage is sent to the first node and it waits for the first node to reply. If the timeout exceeds the LSF_RES_ALIVE_TIMEOUT setting, LSF res on an execution node other than the first knows that the LSF daemons have crashed on the first node. LSF res exits directly on the non-execution node.
If LSF daemons on the first execution node are version 9.1.1, they do not include the LSF_RES_ALIVE_TIMEOUT parameter. Therefore, if 9.1.3 is on a subsequent execution node, it cannot always receive a reply. If LSF daemons on the first execution node detect that some tasks exited, they also exit and the entire job fails to run.
Solution: To run a parallel job in a mixed LSF 9.1.1 and 9.1.3 environment, set LSF_RES_ALIVE_TIMEOUT=0 in job environment variables when submitting the job. The logic will be disabled.
This is a known issue in RHEL NFSv4 (see https://access.redhat.com/site/solutions/130783).
On an NFS client with NFSv4 mount, an error may occur when attempting to chown a file in the mount directory: chown: changing ownership of `filename': Invalid argument
nfs.nfs4_disable_idmapping=1
echo "options nfs nfs4_disable_idmapping=1" > /etc/modprobe.d/99-nfs.conf
echo 1 > /sys/module/nfs/parameters/nfs4_disable_idmapping
and remount the NFSv4 entry point.
Also, when a Compute Unit member is a host group, then that host group cannot contain a wildcard. If you try to configure that case, LSF logs a warning and ignores the Compute Unit.
This occurs when /tmp/PNSD is deleted. In this case, nrt_command() leaves an open socket. This is a PNSD problem and occurs when PE integration is enabled but the node does not have PE installed or configured.
The July 2014 release (LSF 9.1.3) contains all bugs fixed before 30 May 2014. Bugs fixed between 8 October 2013 and 30 May 2014 are listed in the document Fixed Bugs for Platform LSF 9.1.3.
Fixed bugs list documents are available on Platform LSF’s IBM Service Management Connect at www.ibm.com/developerworks/servicemanagement/tc/plsf/index.html. Search for the specific Fixed bugs list document, or go to the LSF Wiki page.
Operating system | Product package |
---|---|
IBM AIX 6 and 7 on IBM Power 6, 7, and 8 | lsf9.1.3_aix-64.tar.Z |
HP UX B.11.31 on PA-RISC | lsf9.1.3_hppa11i-64.tar.Z |
HP UX B.11.31 on IA64 | lsf9.1.3_hpuxia64.tar.Z |
Solaris 10 and 11 on Sparc | lsf9.1.3_sparc-sol10-64.tar.Z |
Solaris 10 and 11 on x86-64 | lsf9.1.3_x86-64-sol10.tar.Z |
Linux on x86-64 Kernel 2.6 and 3.x | lsf9.1.3_linux2.6-glibc2.3-x86_64.tar.Z |
Linux on IBM Power 6, 7, and 8 Kernel 2.6 and 3.x | lsf9.1.3_linux2.6-glibc2.3-ppc64.tar.Z |
Windows 2003/2008/7/8/8.1 32-bit | lsf9.1.3_win32.msi |
Windows 2003/2008/7/8.1/HPC server 2008/2012/ 64-bit | lsf9.1.3_win-x64.msi |
Apple Mac OS 10.x | lsf9.1.3_macosx.tar.Z |
Cray Linux XE6, XT6, XC-30 | lsf9.1.3_lnx26-lib23-x64-cray.tar.Z |
ARMv8 Kernel 3.12 glibc 2.17 | lsf9.1.3_lnx312-lib217-armv8.tar.Z |
ARMv7 Kernel 3.6 glibc 2.15 | lsf9.1.3_lnx36-lib215-armv7.tar.Z |
This is the standard installer package. Use this package in a heterogeneous cluster with a mix of systems other than x86-64 (except zLinux). Requires approximately 1 GB free space.
Use this smaller installer package in a homogeneous x86-64 cluster. If you add other non x86-64 hosts you must use the standard installer package. Requires approximately 100 MB free space.
The same installer packages are used for LSF Express Edition, LSF Standard Edition, and LSF Advanced Edition.
Download the LSF installer package, product distribution packages, and documentation packages from IBM Passport Advantage:
www.ibm.com/software/howtobuy/passportadvantage.