Configuration to enable external load indices

To enable the use of external load indices, you must
  • Define the dynamic external resources in lsf.shared. By default, these resources are host-based (local to each host) until the LSF administrator configures a resource-to-host-mapping in the ResourceMap section of lsf.cluster.cluster_name. The presence of the dynamic external resource in lsf.shared and lsf.cluster.cluster_name triggers LSF to start the elim executables.

  • Map the external resources to hosts in your cluster in lsf.cluster.cluster_name.
    Important:

    You must run the command lsadmin reconfig followed by badmin mbdrestart to apply changes.

  • Create one or more elim executables in the directory specified by the parameter LSF_SERVERDIR. LSF does not include a default elim; you should write your own executable to meet the requirements of your site. The section Create an elim executable provides guidelines for writing an elim.

GPFS ELIM

IBM® General Parallel File System (GPFS™) is a high performance cluster file system. GPFS is a shared disk file system that supports the AIX®, Linux, and Windows operating systems. The main differentiator in GPFS is that it is not a clustered File System but a parallel File System. This means that GPFS can scale almost infinitely. Using Platform RTM, you can monitor GPFS data.

In the RTM GUI, you can monitor GPFS on a per LSF host and a per LSF cluster basis either as a whole or per volume level.

Host level:

  • Average MB In/Out per second

  • Maximum MB In/Out per second

  • Average file Reads/Writes per second

  • Average file Opens/Closes/Directory Reads/Node Updates per second

Cluster level:

  • MB available capacity In/Out

  • Resources can be reserved and used upon present maximum available bandwidth. For example, bsub to reserve 100 kbytes of inbound bandwidth at cluster level for 20 minutes: bsub –q normal –R

    “rusage[gtotalin=100:duration=20]” ./myapplication myapplication_options

Configuring ELIM Script

Configure the following ELIMs in LSF before proceeding:

  • elim.gpfshost - Monitors GPFS performance counters at LSF host level

  • elim.gpfsglobal - Monitors available GPFS bandwidth at LSF cluster level

The ELIM Scripts are available for LSF 9.1.1 and later versions.

  1. Configure the constant of elim.gpfshost:

    1. Configure the monitored GPFS file system name by "VOLUMES".

    2. [Optional] Configure CHECK_INTERVAL, FLOATING_AVG_INTERVAL and DECIMAL_DIGITS.

  2. Configure the constant of elim.gpfsglobal:

    1. Configure the monitored GPFS file system name by "VOLUMES".

    2. Configure the maximum write bandwidth for each GPFS file system by MAX_INBOUND.

    3. Configure the maximum read bandwidth for each GPFS file system by MAX_OUTBOUND.

    4. [Optional] Configure CHECK_INTERVAL, FLOATING_AVG_INTERVAL and DECIMAL_DIGITS.

Configuring LSF cluster

Procedure

  1. Add GPFS node as an LSF server, computenode, or as master candidate.
  2. Configure external load indices as LSF resources, for example:
        gstatus          				  String  (30)       ()          ()
        gbytesin                Numeric (30)       Y           ()
        gbytesout               Numeric (30)       Y           ()
        gopens                  Numeric (30)       Y           ()
        gcloses                 Numeric (30)       Y           ()
        greads                  Numeric (30)       Y           ()
        gwrites                 Numeric (30)       Y           ()
        grdir                   Numeric (30)       Y           ()
        giupdate                Numeric (30)       Y           ()
        gbytesin_gpfs_dev_name  Numeric (30)       Y           ()
        gbytesout_gpfs_dev_name Numeric (30)       Y           ()
        gtotalin                Numeric (30)       N           ()
        gtotalout               Numeric (30)       N           ()
  3. Map the external resources to hosts in the ResourceMap section of lsf.cluster.cluster_name. For example:

    Begin ResourceMap

    RESOURCENAME LOCATION

    #GPFS Per Host Resources
    gstatus            ([hostgpfs01] [hostgpfs02] [hostgpfs03])
    gbytesin           ([hostgpfs01] [hostgpfs02] [hostgpfs03])
    gbytesout          ([hostgpfs01] [hostgpfs02] [hostgpfs03])
    gopens             ([hostgpfs01] [hostgpfs02] [hostgpfs03])
    gcloses            ([hostgpfs01] [hostgpfs02] [hostgpfs03])
    greads             ([hostgpfs01] [hostgpfs02] [hostgpfs03])
    gwrites            ([hostgpfs01] [hostgpfs02] [hostgpfs03])
    grdir              ([hostgpfs01] [hostgpfs02] [hostgpfs03])
    giupdate           ([hostgpfs01] [hostgpfs02] [hostgpfs03])
    gbytesin_gpfs01    ([hostgpfs01] [hostgpfs02] [hostgpfs03])
    gbytesout_gpfs01   ([hostgpfs01] [hostgpfs02] [hostgpfs03])
    gbytesin_gpfs02    ([hostgpfs01] [hostgpfs02] [hostgpfs03])
    gbytesout_gpfs02   ([hostgpfs01] [hostgpfs02] [hostgpfs03])
    #GPFS shared resources
    gtotalin      [all]
    gtotalout     [all]
    End ResourceMap
  4. Copy the elim executables to your cluster($LSF_SERVERDIR). For example:
    #cp elim.gpfshost elim.gpfsglobal $LSF_SERVERDIR
    By default, the ELIM executable is stored in /opt/rtm/etc.
  5. Reconfigure your cluster.
    #lsfadmin reconfig
    #badmin mbdrestart

Define a dynamic external resource

To define a dynamic external resource for which elim collects an external load index value, define the following parameters in the Resource section of lsf.shared:

Configuration file

Parameter and syntax

Description

lsf.shared

RESOURCENAME

resource_name

  • Specifies the name of the external resource.

TYPE

Numeric

  • Specifies the type of external resource: Numeric resources have numeric values.

  • Specify Numeric for all dynamic resources.

INTERVAL

seconds

  • Specifies the interval for data collection by an elim.

  • For numeric resources, defining an interval identifies the resource as a dynamic resource with a corresponding external load index.
    Important:

    You must specify an interval: LSF treats a numeric resource with no interval as a static resource and, therefore, does not collect load index values for that resource.

INCREASING

Y | N

  • Specifies whether a larger value indicates a greater load.
    • Y— a larger value indicates a greater load. For example, if you define an external load index, the larger the value, the heavier the load.

    • N— a larger value indicates a lighter load.

RELEASE

Y | N

  • For shared resources only, specifies whether LSF releases the resource when a job that uses the resource is suspended.
    • Y— Releases the resource.

    • N— Holds the resource.

DESCRIPTION

description

  • Enter a brief description of the resource.

  • The lsinfo command and the ls_info() API call return the contents of the DESCRIPTION parameter.

Map an external resource

Once external resources are defined in lsf.shared, they must be mapped to hosts in the ResourceMap section of lsf.cluster.cluster_name.

Configuration file

Parameter and syntax

Default behavior

lsf.cluster. cluster_name

RESOURCENAMEresource_name

  • Specifies the name of the external resource as defined in the Resource section of lsf.shared.

LOCATION
  • ([all]) | ([all ~host_name])

  • Maps the resource to the master host only; all hosts share a single instance of the dynamic external resource.

  • To prevent specific hosts from accessing the resource, use the not operator (~) and specify one or more host names. All other hosts can access the resource.

  • [default]

  • Maps the resource to all hosts in the cluster; every host has an instance of the dynamic external resource.

  • If you use the default keyword for any external resource, all elim executables in LSF_SERVERDIR run on all hosts in the cluster. For information about how to control which elim executables run on each host, see the section How LSF determines which hosts should run an elim executable.

  • ([host_name]) | ([host_name] [host_name])

  • Maps the resource to one or more specific hosts.

  • To specify sets of hosts that share a dynamic external resource, enclose each set in square brackets ([ ]) and use a space to separate each host name.

Create an elim executable

You can write one or more elim executables. The load index names defined in your elim executables must be the same as the external resource names defined in the lsf.shared configuration file.

All elim executables must
  • Be located in LSF_SERVERDIR and follow these naming conventions:

    Operating system

    Naming convention

    UNIX

    LSF_SERVERDIR\elim.application

    Windows

    LSF_SERVERDIR\elim.application.exe

    or

    LSF_SERVERDIR\elim.application.bat

    Restriction:

    The name elim.user is reserved for backward compatibility. Do not use the name elim.user for your application-specific elim.

    Note:

    LSF invokes any elim that follows this naming convention,—move backup copies out of LSF_SERVERDIR or choose a name that does not follow the convention. For example, use elim_backup instead of elim.backup.

  • Exit upon receipt of a SIGTERM signal from the load information manager (LIM).

  • Periodically output a load update string to stdout in the format number_indices index_name index_value [index_name index_value …] where

    Value

    Defines

    number_indices

    • The number of external load indices that are collected by the elim.

    index_name

    • The name of the external load index.

    index_value

    • The external load index value that is returned by your elim.

For example, the string

3 tmp2 47.5 nio 344.0 tmp 5

reports three indices: tmp2, nio and tmp, with values 47.5, 344.0, and 5, respectively.
    • The load update string must be end with only one \n or only one space. In Windows, echo will add \n.

    • The load update string must report values between -INFINIT_LOAD and INFINIT_LOAD as defined in the lsf.h header file.

    • The elim should ensure that the entire load update string is written successfully to stdout. Program the elim to exit if it fails to write the load update string to stdout.
      • If the elim executable is a C program, check the return value of printf(3s).

      • If the elim executable is a shell script, check the return code of /bin/echo(1).

    • If the elim executable is implemented as a C program, use setbuf(3) during initialization to send unbuffered output to stdout.

    • Each LIM sends updated load information to the master LIM every 15 seconds; the elim executable should write the load update string at most once every 15 seconds. If the external load index values rarely change, program the elim to report the new values only when a change is detected.

If you map any external resource as default in lsf.cluster.cluster_name, all elim executables in LSF_SERVERDIR run on all hosts in the cluster. If LSF_SERVERDIR contains more than one elim executable, you should include a header that checks whether the elim is programmed to report values for the resources expected on the host. For detailed information about using a checking header, see the section How environment variables determine elim hosts.

Overriding built-in load indices

An elim executable can be used to override the value of a built-in load index. For example, if your site stores temporary files in the /usr/tmp directory, you might want to monitor the amount of space available in that directory. An elim can report the space available in the /usr/tmp directory as the value for the tmp built-in load index.

To override a built-in load index value, write an elim executable that periodically measures the value of the dynamic external resource and writes the numeric value to standard output. The external load index must correspond to a numeric, dynamic external resource as defined by TYPE and INTERVAL in lsf.shared.

You can find the built-in load index type and name in the lsinfo output.

For example, an elim collects available space under /usr/tmp as 20M. Then, it can report the value as available tmp space (the built-in load index tmp) in the load update string: 1 tmp 20.

The following built-in load indices cannot be overridden by elim: logins, idle, cpu, and swap

Setting up an ELIM to support JSDL

To support the use of Job Submission Description Language (JSDL) files at job submission, LSF collects the following load indices:

Attribute name

Attribute type

Resource name

OperatingSystemName

string

osname

OperatingSystemVersion

string

osver

CPUArchitectureName

string

cpuarch

IndividualCPUSpeed

int64

cpuspeed

IndividualNetworkBandwidth

int64

bandwidth

(This is the maximum bandwidth).

The file elim.jsdl is automatically configured to collect these resources. To enable the use of elim.jsdl, uncomment the lines for these resources in the ResourceMap section of the file lsf.cluster.cluster_name.

Example of an elim executable

See the section How environment variables determine elim hosts for an example of a simple elim script.

You can find more elim examples in the LSF_MISC/examples directory. The elim.c file is an elim written in C. You can modify this example to collect the external load indices that are required at your site.