Managing Your Cluster
Working with Your Cluster
Learn about LSF
View cluster information
Example directory structures
UNIX and Linux
Microsoft Windows
Add cluster administrators
Control daemons
Control mbatchd
Customize batch command messages
Reconfigure your cluster
Reconfigure the cluster with lsadmin and badmin
Reconfigure the cluster by restarting mbatchd
View configuration errors
Live reconfiguration
bconf authentication
Enable live reconfiguration
Add a host to the cluster using bconf
Create a user group using bconf
Create a limit using bconf
Update a limit using bconf
Add a user share to a fairshare queue
Add consumers to a guaranteed resource pool
View bconf records
Merge configuration files
LSF Daemon Startup Control
About LSF daemon startup control
Configuration to enable LSF daemon startup control
LSF daemon startup control behavior
Configuration to modify LSF daemon startup control
LSF daemon startup control commands
Working with Hosts
Host status
How LIM determines host models and types
View host information
Control hosts
Add a host
Remove a host
Remove a host from master candidate list
Add hosts dynamically
Configure LSF to run batch jobs on dynamic hosts
Change a dynamic host to a static host
Add a dynamic host in a shared file system environment
Add a dynamic host in a non-shared file system environment
Remove dynamic hosts
Automatically detect operating system types and versions
Add a custom host type or model
Register service ports
Host names
Hosts with multiple addresses
Use IPv6 addresses
Specify host names with condensed notation
Host groups
Configure host groups
Wildcards and special characters to define host names
Define condensed host groups
Compute units
Configure compute units
Use wildcards and special characters to define names in compute units
Define condensed compute units
Import external host groups (egroup)
Use compute units with advance reservation
Tune CPU factors
View normalized ratings
Tune CPU factors
Handle host-level job exceptions
Managing Jobs
About job states
View job information
View all jobs for all users
View job IDs
View jobs for specific users
View running jobs
View done jobs
View pending job information
View suspension reasons
View chunk job wait status and wait reason
View post-execution states
View exception status for jobs (bjobs)
View unfinished job summary information
Customize job information output
Change job order within queues
Switch jobs from one queue to another
Switch a single job to a different queue
Switch all jobs to a different queue
Force job execution
Force a pending job to run
Suspend and resume jobs
Suspend a job
Resume a job
Kill jobs
Kill a job
Kill multiple jobs
Force removal of a job from LSF
Remove hung jobs from LSF
Terminate Orphan Jobs
Send a signal to a job
Signals on different platforms
Send a signal to a job
Job groups
Job group limits
Create a job group
Submit jobs under a job group
View information about job groups (bjgroup)
View jobs for a specific job group (bjobs)
Control jobs in job groups
Suspend jobs (bstop)
Resume suspended jobs (bresume)
Move jobs to a different job group (bmod)
Terminate jobs (bkill)
Delete a job group manually (bgdel)
Modify a job group limit (bgmod)
Automatic job group cleanup
Handle job exceptions
Email job exception details
Default eadmin actions
Handle job initialization failures
Set clean period for DONE jobs
Job information access control
Setting job information access control
Working with Queues
Queue states
View queue information
View available queues and queue status
View detailed queue information
View the state change history of a queue
View queue administrators
View exception status for queues (bqueues)
Understand successful application exit values
Specify successful application exit values
Control queues
Handle job exceptions in queues
LSF Resources
About LSF resources
View cluster resources (lsinfo)
View host resources (lshosts)
View host load by resource
Resource categories
View shared resources for hosts
How LSF uses resources
View job resource usage
View load on a host
Load indices
Batch built-in resources
Static resources
How LIM detects cores, threads, and processors
Define ncpus—processors, cores, or threads
Define computation of ncpus on dynamic hosts
Define computation of ncpus on static hosts
Automatic detection of hardware reconfiguration
Set the external static LIM
Portable hardware locality
About configured resources
Add new resources to your cluster
Configure the lsf.shared resource section
Configure lsf.cluster.cluster_name Host section
Configure lsf.cluster.cluster_name ResourceMap section
Reserve a static shared resource
External load indices
Modify a built-in load index
Define GPU or MIC resources
External Load Indices
About external load indices
Configuration to enable external load indices
External load indices behavior
Configuration to modify external load indices
External load indices commands
Managing Users and User Groups
View user and user group information
View user information
View user pending job threshold information
View user group information
View user share information
View user group admin information
About user groups
Existing user groups as LSF user groups
LSF user groups
Configure user groups
Configure user group administrators
Configure user group administrator rights
Import external user groups (egroup)
External Host and User Groups
About external host and user groups
Configuration to enable external host and user groups
External host and user groups behavior
Between-Host User Account Mapping
About between-host user account mapping
Configuration to enable between-host user account mapping
Between-host user account mapping behavior
Between-host user account mapping commands
Cross-Cluster User Account Mapping
About cross-cluster user account mapping
Configuration to enable cross-cluster user account mapping
Cross-cluster user account mapping behavior
Cross-cluster user account mapping commands
UNIX/Windows User Account Mapping
About UNIX/Windows user account mapping
Configuration to enable UNIX/Windows user account mapping
UNIX/Windows user account mapping behavior
Configuration to modify UNIX/Windows user account mapping behavior
UNIX/Windows user account mapping commands
Cluster Version Management and Patching on UNIX and Linux
Scope
Patch installation interaction diagram
Patch rollback interaction diagram
Version management components
Patches and distributions
Version command pversions
Patch installer
Patch history and backups
Cluster patch behavior
Cluster rollback behavior
Version management log files
Version management commands
Install update releases on UNIX and Linux
Install fixes on UNIX and Linux
Roll back patches on UNIX and Linux
Monitoring Your Cluster
Achieving Performance and Scalability
Optimize performance in large sites
Tune UNIX for large clusters
Increase the file descriptor limit
Tune LSF for large clusters
Manage scheduling performance
Enable fast job dispatch
Enable continuous scheduling
Use scheduler threads to evaluate resource requirement matching
Limit job dependency evaluation
Limit the number of batch queries
Improve the speed of host status updates
Limit your user’s ability to move jobs in a queue
Manage the number of pending reasons
Achieve efficient event switching
Automatic load updates
Manage I/O performance of the info directory
Job ID limit
Monitor performance metrics in real time
Event Generation
Event generation
Enable event generation for custom programs
Events list
Arguments passed to the LSF event program
Tuning the Cluster
Tune LIM
Load thresholds
Compare LIM load thresholds
LIM reports a host as busy
Interactive jobs
Multiprocessor systems
How LSF works with LSF_MASTER_LIST
Improve mbatchd response time after mbatchd restart
Improve performance of mbatchd query requests on UNIX
Configure mbatchd to use multithreading
Diagnose query requests
Logging mbatchd performance metrics
Improve performance of mbatchd for job array switching events
Increase queue responsiveness
Authentication and Authorization
Change authentication method
Authentication options
Operating system authorization
LSF authorization
Authorization failure
Submitting Jobs with SSH
About SSH
Configuration to enable SSH
Configuration to modify SSH (X11 forwarding)
SSH commands
Troubleshoot SSH X11 forwarding (-XF)
Troubleshoot SSH (-IX)
External Authentication
About external authentication (eauth)
Configuration to enable external authentication
External authentication behavior
Configuration to modify external authentication
External authentication commands
Job Email and Job File Spooling
Email notification
Disable job email
Size of job email
Directory for job output
Specify a directory for job output
File spooling for job input, output, and command files
Specify job input file
Change job input file
Job spooling directory (JOB_SPOOL_DIR)
Specify a job command file (bsub -Zs)
Non-Shared File Systems
About directories and files
Use LSF with non-shared file systems
Remote file access with non-shared file space
Copy files from the submission host to execution host
Specify input file
Copy output files back to the submission host
File transfer mechanism (lsrcp)
Error and Event Logging
System directories and log files
Log levels and descriptions
Manage error logs
Set the log files owner
View the number of file descriptors remaining
Locate Error logs
System event log
Duplicate logging of event logs
Configure duplicate logging
LSF job termination reason logging
View logged job exit information (bacct -l)
View recent job exit information (bjobs -l)
Termination reasons displayed by bacct, bhist and bjobs
LSF job exit codes
Troubleshooting and Error Messages
Shared file access
Shared files on Windows
Common LSF problems
Error messages
Set daemon message log to debug level
Set daemon timing levels
Time-Based Configuration
Time Configuration
Time windows
Time expressions
Automatic time-based configuration
Verify configuration
Dispatch and run windows
Run windows
Configure run windows
View information about run windows
Dispatch windows
Configure host dispatch windows
Configure queue dispatch windows
Display host dispatch windows
Display queue dispatch windows
Deadline constraint scheduling
Disable deadline constraint scheduling
Advance Reservation
About advance reservations
Enable advance reservation
Allow users to create advance reservations
Use advance reservation
Add reservations
Use brsvmod to modify advance reservations
Remove an advance reservation
View reservations
Submit and modify jobs using advance reservations
Advance reservation behavior
Job Scheduling Policies
Preemptive Scheduling
About preemptive scheduling
Configuration to enable preemptive scheduling
Preemptive scheduling behavior
Configuration to modify preemptive scheduling behavior
Preemptive scheduling commands
Specifying Resource Requirements
About resource requirements
Queue-level resource requirements
View queue-level resource requirements
Job-level resource requirements
View job-level resource requirements
About resource requirement strings
Selection string
Order string
Usage string
Span string
Same string
Compute unit string
Affinity string
Fairshare Scheduling
Understand fairshare scheduling
User share assignments
Dynamic user priority
Use time decay and committed run time
Historical run time decay
Configure historical run time
How mbatchd reconfiguration and restart affects historical run time
Run time decay
Configure run time decay
Committed run time weighting factor
Configure committed run time
How fairshare affects job dispatch order
Host partition user-based fairshare
View host partition information
Configure host partition fairshare scheduling
Queue-level user-based fairshare
View queue-level fairshare information
Configure queue-level fairshare
Cross-queue user-based fairshare
View cross-queue fairshare information
Configure cross-queue fairshare
Control job dispatch order in cross-queue fairshare
Hierarchical user-based fairshare
View hierarchical share information for a group
View hierarchical share information for a host partition
Configure hierarchical fairshare
Configure a share tree
Queue-based fairshare
Slot allocation per queue
Configure slot allocation per queue
View configured job slot share
View slot allocation of running jobs
Typical slot allocation scenarios
Users affected by multiple fairshare policies
Submit a job and specify a user group
Ways to configure fairshare
Host partition fairshare
Configure host partition fairshare
Chargeback fairshare
Configure chargeback fairshare
Equal share
Configure equal share
Priority user and static priority fairshare
Configure priority user fairshare
Configure static priority fairshare
Resizable jobs and fairshare
Resource Preemption
About resource preemption
Requirements for resource preemption
Custom job controls for resource preemption
Resource preemption steps
Configure resource preemption
Memory preemption
Guaranteed Resource Pools
About guaranteed resources
Configuration overview of guaranteed resource pools
Submitting jobs to use guarantees
Package guarantees
Viewing guarantee policy information
Goal-Oriented SLA-Driven Scheduling
Using goal-oriented SLA scheduling
Configuring Service Classes for SLA Scheduling
Viewing Information about SLAs and Service Classes
Time-based service classes
Configure time-based service classes
Time-based SLA examples
Job groups and time-based SLAs
View job groups attached to a time-based SLA (bjgroup)
SLA CONTROL_ACTION parameter (lsb.serviceclasses)
Submit jobs to a service class
Modify SLA jobs (bmod)
View configured guaranteed resource pools
Monitor the progress of an SLA (bsla)
Exclusive Scheduling
Use exclusive scheduling
Configure an exclusive queue
Configure a host to run one job at a time
Submit an exclusive job
Configure a compute unit exclusive queue
Submit a compute unit exclusive job
Job Scheduling and Dispatch
Working with Application Profiles
Manage application profiles
Add an application profile
Understand successful application exit values
Specify successful application exit values
Submit jobs to application profiles
View application profile information
View available application profiles
How application profiles interact with queue and job parameters
Application profile settings that override queue settings
Application profile limits and queue limits
Define application-specific environment variables
Task limits
Absolute run limits
Pre-execution
Post-execution
Chunk job scheduling
Rerunnable jobs
Resource requirements
Estimated runtime and runtime limits
Job Directories and Data
Temporary job directories
About flexible job CWD
About flexible job output directory
Resource Allocation Limits
Resource allocation limits
Configure resource allocation limits
Enable resource allocation limits
Configure cluster-wide limits
How resource allocation limits map to pre-version 7 job slot limits
Limit conflicts
How job limits work
View information about resource allocation limits
Reserving Resources
About resource reservation
Use resource reservation
Configure resource reservation at the queue level
Specify job-level resource reservation
Configure per-resource reservation
Memory reservation for pending jobs
Reserve host memory for pending jobs
Enable memory reservation for sequential jobs
Configure lsb.queues
Use memory reservation for pending jobs
How memory reservation for pending jobs works
Time-based slot reservation
Configure time-based slot reservation
Assumptions and limitations
Reservation scenarios
Examples
View resource reservation information
View host-level resource information (bhosts)
View queue-level resource information (bqueues)
View reserved memory for pending jobs (bjobs)
View per-resource reservation (bresources)
Job Dependency and Job Priority
Job dependency terminology
Job dependency scheduling
Specify a job dependency
Dependency conditions
View job dependencies
Job priorities
User-assigned job priority
Configure job priority
Specify job priority
View job priority information
Automatic job priority escalation
Configure job priority escalation
Absolute job priority scheduling
Enable absolute priority scheduling
Modify the system APS value (bmod)
Configure APS across multiple queues
Job priority behavior
Job Requeue and Job Rerun
About job requeue
Automatic job requeue
Configure automatic job requeue
Job-level automatic requeue
Configure reverse requeue
Exclusive job requeue
Configure exclusive job requeue
Requeue a job
Automatic job rerun
Configure queue-level job rerun
Submit a rerunnable job
Submit a job as not rerunnable
Disable post-execution for rerunnable jobs
Job Migration
About job migration
Configuration to enable job migration
Job migration behavior
Configuration to modify job migration
Job migration commands
Job Checkpoint and Restart
About job checkpoint and restart
Configuration to enable job checkpoint and restart
Job checkpoint and restart behavior
Configuration to modify job checkpoint and restart
Job checkpoint and restart commands
Resizable Jobs
About resizable jobs
Configuration to enable resizable jobs
Configuration to modify resizable job behavior
Resizable job commands
Autoresizable job management
Submit an autoresizable job
Check pending resize requests
Cancel an active pending request
Specify a resize notification command manually
Script for resizing
How resizable jobs works with other LSF features
Chunk Jobs and Job Arrays
Chunk job dispatch
Configure queue-level job chunking
Configure application-level job chunking
Configure limited job chunking
How LSF submits and controls chunk jobs
Enforce resource usage limits on chunk jobs
Job arrays
Create a job array
Handle input and output files
Prepare input files
Pass arguments on the command line
Set a whole array dependency
Monitor job arrays
Control job arrays
Job array chunking
Requeue jobs in DONE state
Job array job slot limit
Set a job array slot limit at submission
Job Packs
Energy Aware Scheduling
About Energy Aware Scheduling (EAS)
Managing host power states
Configuring host power state management
Power parameters in lsb.params
PowerPolicy section in lsb.resources
Controlling and monitoring host power state management
Valid host statuses for power saved mode
Disabling the power operation feature
Changing lsf.shared / lsf.cluster
Integration with Advance Reservation
Integration with provisioning systems
CPU frequency management
Configuring CPU frequency management
Specifying CPU frequency management for jobs
Job energy usage reporting
Resource usage in job summary email
Automatic CPU frequency selection
Prerequisites
Configure MySQL database
Configuring automatic CPU frequency selection
Installing and configuring benchmarking programs
Checking compute node performance
Calculating coefficient data
Setting a default CPU frequency
Creating an energy policy tag
Energy policy tag format
Generate an energy policy tag
Enable automatic CPU frequency selection
Job Execution and Interactive Jobs
Runtime Resource Usage Limits
About resource usage limits
Enforce limits on chunk jobs
Scaling the units for resource usage limits
Specify resource usage limits
Default run limits for backfill scheduling
Specify job-level resource usage limits
Supported resource usage limits and syntax
Examples
CPU time and run time normalization
Memory enforcement based on Linux cgroup memory subsystem
PAM resource limits
Configure a PAM file
Load Thresholds
Automatic job suspension
Suspending conditions
Configure suspending conditions at queue level
View host-level and queue-level suspending conditions
View job-level suspending conditions
View suspend reason
About resuming suspended jobs
Specify resume condition
View resume thresholds
Pre-Execution and Post-Execution Processing
About pre- and post-execution processing
Configuration to enable pre- and post-execution processing
Pre- and post-execution processing behavior
Check job history for a pre-execution script failure
Configuration to modify pre- and post-execution processing
Set host exclusion based on job-based pre-execution scripts
Pre- and post-execution processing commands
Job Starters
About job starters
Command-level job starters
Queue-level job starters
Configure a queue-level job starter
JOB_STARTER parameter (lsb.queues)
Control the execution environment with job starters
Job Controls
Job Controls
External Job Submission and Execution Controls
About job submission and execution controls
Configuration to enable job submission and execution controls
Job submission and execution controls behavior
Configuration to modify job submission and execution controls
Job submission and execution controls commands
Command arguments for job submission and execution controls
Interactive Jobs with bsub
About interactive jobs
Submit interactive jobs
Submit an interactive job
Submit an interactive job by using a pseudo-terminal
Submit an interactive job and redirect streams to files
Submit an interactive job, redirect streams to files, and display streams
Performance tuning for interactive batch jobs
Interactive batch job messaging
Configure interactive batch job messaging
Example messages
Run X applications with bsub
Configure SSH X11 forwarding for jobs
Write job scripts
Register utmp file entries for interactive batch jobs
Interactive and Remote Tasks
Run remote tasks
Run a task on the best available host
Run a task on a host with specific resources
Resource usage
Run a task on a specific host
Run a task by using a pseudo-terminal
Run the same task on many hosts in sequence
Run parallel tasks
Run tasks on hosts specified by a file
Interactive tasks
Redirect streams to files
Load sharing interactive sessions
Log on to the least loaded host
Log on to a host with specific resources
Load sharing X applications
Start an xterm
xterm on a PC
Set up Exceed to log on the least loaded host
Start an xterm in Exceed
Examples
Running Parallel Jobs
How LSF runs parallel jobs
Preparing your environment to submit parallel jobs to LSF
Use a job starter
Submit a parallel job
Start parallel tasks with LSF utilities
Job slot limits for parallel jobs
Specify a minimum and maximum number of tasks
Restrict job size requested by parallel jobs
About specifying a first execution host
Specify a first execution host
Rules
Control job locality using compute units
Control processor allocation across hosts
Run parallel processes on homogeneous hosts
Limit the number of processors allocated
Limit the number of allocated hosts
Reserve processors
Configure processor reservation
View information about reserved job slots
Reserve memory for pending parallel jobs
Configure memory reservation for pending parallel jobs
Enable per-slot memory reservation
Backfill scheduling
Configure a backfill queue
Enforce run limits
View information about job start time
Use backfill on memory
Use interruptible backfill
Configure an interruptible backfill queue
View the run limits for interruptible backfill jobs (bjobs and bhist)
Display available slots for backfill jobs
Submit backfill jobs according to available slots
Parallel fairshare
Configure parallel fairshare
How deadline constraint scheduling works for parallel jobs
Optimized preemption of parallel jobs
Configure optimized preemption
Controlling CPU and memory affinity for NUMA hosts
Submitting jobs with affinity resource requirements
Managing jobs with affinity resource requirements
Affinity preemption
Affinity binding based on Linux cgroup cpuset subsystem
Processor binding for LSF job processes
Enabling processor binding for LSF job processes
Processor binding for parallel jobs
Running Parallel Jobs with blaunch
blaunch Distributed Application Framework
SGI Vendor MPI Support
Running Jobs with Task Geometry
Enforcing Resource Usage Limits for Parallel Tasks
Running MPI workload through IBM Parallel Environment Runtime Edition
Enabling IBM PE Runtime Edition for LSF
Network-aware scheduling
Submitting IBM Parallel Environment jobs through LSF
Managing IBM Parallel Environment jobs through LSF
Using LSF with the Etnus TotalView Debugger
How LSF Works with TotalView
Running Jobs for TotalView Debugging
Controlling and Monitoring Jobs Being Debugged in TotalView
Appendices
Submitting Jobs Using JSDL
Use JSDL files with LSF
Submit a job using a JSDL file
Collect resource values using elim.jsdl
Enable JSDL resource collection
Using lstch
About lstcsh
Change task list membership
Local and remote modes
Differences from other shells
Limitations
Start lstcsh
Use lstcsh as your login shell
Use chsh
Use a standard system shell
Host redirection
Task control
Bring a remote background task to the foreground
Built-in commands
Shell scripts in lstcsh
Run a script with load sharing enabled
Using Session Scheduler
About IBM Platform Session Scheduler
How Session Scheduler Runs Tasks
Running and monitoring Session Scheduler jobs
Troubleshooting
Using lsmake
About IBM Platform Make
How IBM Platform Make works
lsmake performance
Managing LSF on EGO
About LSF on IBM EGO
LSF and EGO directory structure
Configure LSF and EGO
LSF and EGO corresponding parameters
Parameters that have changed in LSF 9
Special resource groups for LSF master hosts
Manage LSF daemons through EGO
Bypass EGO login at startup (lsf.sudoers)
Administrative basics
Set the command-line environment
Logging and troubleshooting
Frequently asked questions
LSF Integrations
Using LSF with SGI Cpusets
About SGI cpusets
Configuring LSF with SGI Cpusets
Using LSF with SGI Cpusets
Using SGI Comprehensive System Accounting facility (CSA)
Using SGI Cpusets with ULDB
SGI Job Container and Process Aggregate Support
Using LSF Parallel Application Integrations
Using LSF with ANSYS
Using LSF with NCBI BLAST
Using LSF with FLUENT
Using LSF with Gaussian
Using LSF with Lion Bioscience SRS
Using LSF with LSTC LS-DYNA
Using LSF with MSC Nastran
LSF Integration with Cray Linux
Launching ANSYS Jobs
PVM Jobs