bhosts

displays hosts and their static and dynamic resources

Synopsis

bhosts [-a] [-alloc] [-cname] [-e | -l | -w] [-x] [-X] [-R "res_req"] [host_name | host_group | compute_unit] ...

bhosts [-a] [-cname] [-e | -l | -w] [-X] [-R "res_req"] [cluster_name]

bhosts [-a] [-cname] [-e ] -s [resource_name] ...

bhosts [-aff] [-l] [[host_name | host_group | compute_unit] ... ] | [cluster_name]]

bhosts [-h | -V]

Description

By default, returns the following information about all hosts: host name, host status, job state statistics, and job slot limits.

bhosts displays output for condensed host groups and compute units. These host groups and compute units are defined by CONDENSE in the HostGroup and ComputeUnit sections of lsb.hosts. Condensed host groups and compute units are displayed as a single entry with the name as defined by GROUP_NAME or NAME in lsb.hosts.

When LSF adds more resources to a running resizable job, bhosts displays the added resources. When LSF removes resources from a running resizable job, bhosts displays the updated resources.

The -l and -X options display uncondensed output.

The -s option displays information about the numeric shared resources and their associated hosts.

With MultiCluster, displays the information about hosts available to the local cluster. Use -e to view information about exported hosts.

Options

-a

Dynamic Cluster only. Shows information about all hosts, including Dynamic Cluster virtual machine hosts configured with the jobvm resource. Default output includes only standard LSF hosts and Dynamic Cluster hosts configured with the dchost resource.

-aff

Displays host topology information for CPU and memory affinity scheduling.

-alloc

Shows counters for slots in RUN, SSUSP, USUSP, and RSV. The slot allocation will be different depending on whether the job is an exclusive job or not. End of change

-cname

In LSF Advanced Edition, includes the cluster name for execution cluster hosts in output. The output displayed is sorted by cluster and then by host name.

-e

MultiCluster only. Displays information about resources that have been exported to another cluster.

-l

Displays host information in a (long) multi-line format. In addition to the default fields, displays information about the CPU factor, the current load, and the load thresholds. Also displays the value of slots for each host. slots is the greatest number of unused slots on a host.

bhosts -l also displays information about the dispatch windows.

When PowerPolicy is enabled (in lsb.threshold), bhosts -l also displays host power states. Final power states are on or suspend. Intermediate power states are restarting, resuming, and suspending. The final power state under admin control is closed_Power. The final power state under policy control is ok_Power. If the host status becomes unknown (power operation due to failure), the power state is shown as a dash (“-”).

If you specified an administrator comment with the -C option of the host control commands hclose or hopen, -l displays the comment text.

-w

Displays host information in wide format. Fields are displayed without truncation.

For condensed host groups and compute units , the -w option displays the overall status and the number of hosts with the ok, unavail, unreach, and busy status in the following format:

host_group_status num_ok/num_unavail/num_unreach/num_busy

where

host_group_status is the overall status of the host group or compute unit. If a single host in the group or unit is ok, the overall status is also ok.
num_ok, num_unavail, num_unreach, and num_busy are the number of hosts that are ok, unavail, unreach, and busy, respectively.

For example, if there are five ok, two unavail, one unreach, and three busy hosts in a condensed host group hg1, its status is displayed as the following:

hg1 ok 5/2/1/3

If any hosts in the host group or compute unit are closed, the status for the host group is displayed as closed, with no status for the other states:

hg1 closed

-x

Display hosts whose job exit rate has exceeded the threshold configured by EXIT_RATE in lsb.hosts for longer than JOB_EXIT_RATE_DURATION configured in lsb.params, and are still high. By default, these hosts are closed the next time LSF checks host exceptions and invokes eadmin.

Use with the -l option to show detailed information about host exceptions.

If no hosts exceed the job exit rate, bhosts -x displays:

There is no exceptional host found

-X

Displays uncondensed output for host groups and compute units.

-R "res_req"

Only displays information about hosts that satisfy the resource requirement expression. For more information about resource requirements, see Administering IBM Platform LSF. The size of the resource requirement string is limited to 512 bytes.

Note: Do not specify resource requirements using the rusage keyword to select hosts as the criteria will be ignored by LSF.

LSF supports ordering of resource requirements on all load indices, including external load indices, either static or dynamic.

-s [resource_name ...]

Specify shared numeric resources only. Displays information about the specified resources. Returns the following information: the resource names, the total and reserved amounts, and the resource locations.

bhosts -s only shows consumable resources.

When LOCAL_TO is configured for a license feature in lsf.licensescheduler, bhosts -s shows different resource information depending on the cluster locality of the features. For example:

From clusterA:

bhosts -s 
RESOURCE                 TOTAL       RESERVED       LOCATION
hspice                   36.0        0.0            host1

From clusterB in siteB:

bhosts -s 
RESOURCE                 TOTAL       RESERVED       LOCATION
hspice                   76.0        0.0            host2

Start of change When License Scheduler is configured to work with LSF AE submission and execution clusters, LSF AE considers License Scheduler cluster mode and fast dispatch project mode features to be shared features. When running bhosts -s from a host in the submission cluster, bhosts -s shows no TOTAL and RESERVED tokens available for the local hosts in the submission cluster, but shows the number of available tokens for TOTAL and the number of used tokens for RESERVED in the execution clusters. End of change

host_name ... | host_group ... | compute unit ...

Only displays information about the specified hosts. Do not use quotes when specifying multiple hosts.

For host groups and compute units, the names of the member hosts are displayed instead of the name of the host group or compute unit. Do not use quotes when specifying multiple host groups or compute units.

cluster_name

MultiCluster only. Displays information about hosts in the specified cluster.

-h

Prints command usage to stderr and exits.

-V

Prints LSF release version to stderr and exits.

Output: Host-based default

Displays the following fields:

HOST_NAME

The name of the host. If a host has batch jobs running and the host is removed from the configuration, the host name is displayed as lost_and_found.

For condensed host groups, this is the name of host group.

STATUS

With MultiCluster, not shown for fully exported hosts.

The current status of the host and the sbatchd daemon. Batch jobs can only be dispatched to hosts with an ok status. The possible values for host status are as follows:

ok

The host is available to accept batch jobs.

For condensed host groups, if a single host in the host group is ok, the overall status is also shown as ok.

If any host in the host group or compute unit is not ok, bhosts displays the first host status it encounters as the overall status for the condensed host group. Use bhosts -X to see the status of individual hosts in the host group or compute unit.

unavail

The host is down, or LIM and sbatchd on the host are unreachable.

unreach

LIM on the host is running but sbatchd is unreachable.

closed

The host is not allowed to accept any remote batch jobs. There are several reasons for the host to be closed (see Host-Based -l Options).

closed_Cu_excl

This host is a member of a compute unit running an exclusive compute unit job.

JL/U

With MultiCluster, not shown for fully exported hosts.

The maximum number of job slots that the host can process on a per user basis. If a dash (-) is displayed, there is no limit.

For condensed host groups or compute units, this is the total number of job slots that all hosts in the group or unit can process on a per user basis.

The host does not allocate more than JL/U job slots for one user at the same time. These job slots are used by running jobs, as well as by suspended or pending jobs that have slots reserved for them.

For preemptive scheduling, the accounting is different. These job slots are used by running jobs and by pending jobs that have slots reserved for them (see the description of PREEMPTIVE in lsb.queues(5) and JL/U in lsb.hosts(5)).

MAX

The maximum number of job slots available. If a dash (-) is displayed, there is no limit.

For condensed host groups and compute units, this is the total maximum number of job slots available in all hosts in the host group or compute unit.

These job slots are used by running jobs, as well as by suspended or pending jobs that have slots reserved for them.

If preemptive scheduling is used, suspended jobs are not counted (see the description of PREEMPTIVE in lsb.queues(5) and MXJ in lsb.hosts(5)).

A host does not always have to allocate this many job slots if there are waiting jobs; the host must also satisfy its configured load conditions to accept more jobs.

NJOBS

The number of tasks for all jobs dispatched to the host. This includes running, suspended, and chunk jobs.

For condensed host groups and compute units, this is the total number of tasks used by jobs dispatched to any host in the host group or compute unit.

If -alloc is used, total will be the sum of the RUN, SSUSP, USUSP, and RSV counters.

RUN

The number of tasks for all running jobs on the host.

For condensed host groups and compute units, this is the total number of tasks for running jobs on any host in the host group or compute unit. If -alloc is used, total will be allocated slots for the jobs on the host.

SSUSP

The number of tasks for all system suspended jobs on the host.

For condensed host groups and compute units, this is the total number of tasks for all system suspended jobs on any host in the host group or compute unit. If -alloc is used, total will be allocated slots for the jobs on the host.

USUSP

The number of tasks for all user suspended jobs on the host. Jobs can be suspended by the user or by the LSF administrator.

For condensed host groups and compute units, this is the total number of tasks for all user suspended jobs on any host in the host group or compute unit. If -alloc is used, total will be allocated slots for the jobs on the host.

RSV

The number of tasks for all pending jobs that have slots reserved on the host.

For condensed host groups and compute units, this is the total number of tasks for all pending jobs that have slots reserved on any host in the host group or compute unit. If -alloc is used, total will be allocated slots for the jobs on the host.

Output: Host-based -l option

In addition to the above fields, the -l option also displays the following:

loadSched, loadStop

The scheduling and suspending thresholds for the host. If a threshold is not defined, the threshold from the queue definition applies. If both the host and the queue define a threshold for a load index, the most restrictive threshold is used.

The migration threshold is the time that a job dispatched to this host can remain suspended by the system before LSF attempts to migrate the job to another host.

If the host's operating system supports checkpoint copy, this is indicated here. With checkpoint copy, the operating system automatically copies all open files to the checkpoint directory when a process is checkpointed. Checkpoint copy is currently supported only on Cray systems.

STATUS

The long format shown by the -l option gives the possible reasons for a host to be closed. If PowerPolicy is enabled in lsb.threshold, it will show the power state:

closed_Adm: The host is closed by the LSF administrator or root (see badmin(8)) using badmin hclose. No job can be dispatched to the host, but jobs that are running on the host are not affected.
closed_Busy: The host is overloaded. At least one load index exceeds the configured threshold (see lsb.hosts(5)). Indices that exceed their threshold are identified by an asterisk (*). No job can be dispatched to the host, but jobs that are running on the host are not affected.
closed_Cu_Excl: This host is a member of a compute unit running an exclusive compute unit job (bsub -R "cu[excl]").
closed_EGO: For EGO-enabled SLA scheduling, host is closed because it has not been allocated by EGO to run LSF jobs. Hosts allocated from EGO display status ok.
closed_Excl: The host is running an exclusive job (bsub -x).
closed_Full: The maximum number of job slots on the host has been reached. No job can be dispatched to the host, but jobs that are running on the host are not affected.
closed_LIM: LIM on the host is unreachable, but sbatchd is running.
closed_Lock: The host is locked by the LSF administrator or root (see lsadmin(8)) using lsadmin limlock. Running jobs on the host are suspended by LSF (SSUSP). Use lsadmin limunlock to unlock LIM on the local host.
closed_Wind: The host is closed by a dispatch window defined in the configuration file lsb.hosts(5). No job can be dispatched to the host, but jobs that are running on the host are not affected.
on: The host power state is “On” (Note: power state “on” does not mean the batch host state is “ok”, which depends on whether lim and sbatchd can be connected by the master host.)
off: The host is powered off by policy or manually.
suspend: The host is suspended by policy or manually with badmin hpower.
restarting: The host is resetting when resume operation failed.
resuming: The host is being resumed from standby state which is triggered by either policy or cluster administrator.
suspending: The host is being suspended which is triggered by either policy or cluster administrator.
closed_Power: The host it is put into power saving (suspend) state by the cluster administrator.
ok: Host suspend triggered by power policy.

CPUF

Displays the CPU normalization factor of the host (see lshosts(1)).

DISPATCH_WINDOW

Displays the dispatch windows for each host. Dispatch windows are the time windows during the week when batch jobs can be run on each host. Jobs already started are not affected by the dispatch windows. When the dispatch windows close, jobs are not suspended. Jobs already running continue to run, but no new jobs are started until the windows reopen. The default for the dispatch window is no restriction or always open (that is, twenty-four hours a day and seven days a week). For the dispatch window specification, see the description for the DISPATCH_WINDOWS keyword under the -l option in bqueues(1).

CURRENT LOAD

Displays the total and reserved host load.

Reserved

You specify reserved resources by using bsub -R. These resources are reserved by jobs running on the host.

Total

The total load has different meanings depending on whether the load index is increasing or decreasing.

For increasing load indices, such as run queue lengths, CPU utilization, paging activity, logins, and disk I/O, the total load is the consumed plus the reserved amount. The total load is calculated as the sum of the current load and the reserved load. The current load is the load seen by lsload(1).

For decreasing load indices, such as available memory, idle time, available swap space, and available space in tmp, the total load is the available amount. The total load is the difference between the current load and the reserved load. This difference is the available resource as seen by lsload(1).

LOAD THRESHOLD

Displays the scheduling threshold loadSched and the suspending threshold loadStop. Also displays the migration threshold if defined and the checkpoint support if the host supports checkpointing.

The format for the thresholds is the same as for batch job queues (see bqueues(1)) and lsb.queues(5)). For an explanation of the thresholds and load indices, see the description for the "QUEUE SCHEDULING PARAMETERS" keyword under the -l option in bqueues(1).

THRESHOLD AND LOAD USED FOR EXCEPTIONS

Displays the configured threshold of EXIT_RATE for the host and its current load value for host exceptions.

ADMIN ACTION COMMENT

If the LSF administrator specified an administrator comment with the -C option of the badmin host control commands hclose or hopen, the comment text is displayed.

PE NETWORK INFORMATION

Displays network resource information for IBM Parallel Edition (PE) jobs submitted with the bsub -network option, or to a queue (defined in lsb.queues) or an application profile (defined in lsb.applications) with the NETWORK_REQ parameter defined.

For example:

bhosts -l

...
PE NETWORK INFORMATION
NetworkID                      Status                 rsv_windows/total_windows
1111111                        ok                           4/64 
2222222                        closed_Dedicated             4/64 
...

NetworkID is the physical network ID returned by PE.

Network Status is one of the following:

ok - normal status
closed_Full - all network windows are reserved
closed_Dedicated - a dedicated PE job is running on the network (usage=dedicated specified in the network resource requirement string)
unavail - network information is not available

CONFIGURED AFFINITY CPU LIST

The host is configured in lsb.hosts to accept jobs for CPU and memory affinity scheduling. If AFFINITY is configured as Y, the keyword all is displayed. If a CPU list is specified under the AFFINITY column, the configured CPU list for affinity scheduling is displayed.

Output: Resource-based -s option

The -s option displays the following: the amounts used for scheduling, the amounts reserved, and the associated hosts for the resources. Only resources (shared or host-based) with numeric values are displayed. See lim, and lsf.cluster on how to configure shared resources.

The following fields are displayed:

RESOURCE: The name of the resource.
TOTAL: The total amount free of a resource used for scheduling.
RESERVED: The amount reserved by jobs. You specify the reserved resource using bsub -R.
LOCATION: The hosts that are associated with the resource.

Output: Host-based -aff option

The -aff option displays host topology information for CPU and memory affinity scheduling. Only the topology nodes containing CPUs in the CPULIST defined in lsb.hosts are displayed.

The following fields are displayed:

AFFINITY

If the host is configured in lsb.hosts to accept jobs for CPU and memory affinity scheduling, and the host supports affinity scheduling, AFFINITY: Enabled is displayed. If the host is configured in lsb.hosts to accept jobs for CPU and memory affinity scheduling, but the host does not support affinity scheduling, AFFINITY: Disabled (not supported) is displayed. If the host is LIM is not available or sbatchd is unreachable, AFFINITY: UNKNOWN is displayed.

Host[memory] host_name

Maximum available memory on the host. If memory availability cannot be determined, a dash (-) is displayed for the host. If the -l option is specified with the -aff option, the host name is not displayed.

For hosts that do not support affinity scheduling, a dash (-) is displayed for host memory and no host topology is displayed.

NUMA[numa_node: requested_mem / max_mem]

Requested and total NUMA node memory. It is possible for requested memory for the NUMA node to be greater than the maximum available memory displayed.

A socket is a collection of cores with a direct pipe to memory. Each socket contains 1 or more cores. This does not necessarily refer to a physical socket, but rather to the memory architecture of the machine.

A core is a single entity capable of performing computations.

A node contains sockets, a socket contains cores, and a core can contain threads if the core is enabled for multithreading.

If no NUMA nodes are present, then the NUMA layer in the output is not shown. Other relevant items such as host, socket, core and thread are still shown.

If the host is not available, only the host name is displayed. A dash (-) is shown where available host memory would normally be displayed.

For example:

bhosts -l -aff hostA
HOST  hostA
STATUS           CPUF  JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV DISPATCH_WINDOW
ok              60.00     -      8      0      0      0      0      0      -

 CURRENT LOAD USED FOR SCHEDULING:
                r15s   r1m  r15m    ut    pg    io   ls    it   tmp   swp   mem  slots
 Total           0.0   0.0   0.0   30%   0.0   193   25     0 8605M  5.8G 13.2G      8
 Reserved        0.0   0.0   0.0    0%   0.0     0    0     0    0M    0M    0M      -


 LOAD THRESHOLD USED FOR SCHEDULING:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -


 CONFIGURED AFFINITY CPU LIST: all

 AFFINITY: Enabled
 Host[15.7G]
     NUMA[0: 100M / 15.7G]
         Socket0
             core0(0)
         Socket1
             core0(1)
         Socket2
             core0(2)
         Socket3
             core0(3)
         Socket4
             core0(4)
         Socket5
             core0(5)
         Socket6
             core0(6)
         Socket7
             core0(7)

When LSF detects missing elements in the topology, it attempts to correct the problem by adding the missing levels into the topology. For example, sockets and cores are missing on hostB below:

...
Host[1.4G] hostB
    NUMA[0: 1.4G / 1.4G] (*0 *1)
...

A job requesting 2 cores, or 2 sockets, or 2 CPUs will run. Requesting 2 cores from the same NUMA node will also run. However, a job requesting 2 cores from the same socket will remain pending.

Files

Reads lsb.hosts.