Using LSF with SGI Cpusets

Specifying cpuset properties for jobs

To specify cpuset properties for LSF jobs, use:

  • The -extsched option of bsub.

  • DEFAULT_EXTSCHED or MANDATORY_EXTSCHED, or both, in the queue definition (lsb.queues).

If a job is submitted with the -extsched option, LSF submits jobs with hold, then resumes the job before dispatching it to give time for LSF to attach the -extsched options. The job starts on the first execution host.

The syntax for -extsched is:

-ext[sched] "[SGI_]CPUSET[cpuset_options]"

This specifies a list of CPUs and cpuset attributes used by LSF to allocate a cpuset for the job. You can abbreviate the -extsched option to -ext. Use keyword CPUSET[] to identify the external scheduler parameters, where cpuset_options are:

  • CPUSET_TYPE=static |dynamic | none: Specifies the type of cpuset to be allocated. If you specify none, no cpuset is allocated and you cannot specify any other cpuset options, and the job runs outside of any cpuset.

  • CPUSET_NAME=name: Name of a static cpuset. If you specify CPUSET_TYPE=static, you must provide a cpuset name. If you specify a cpuset name, but specify CPUSET_TYPE that is not static, the job is rejected.

The following options are only valid for dynamic cpusets:

  • MAX_RADIUS=radius: Radius is the maximum cpuset radius the job can accept. If the radius requirement cannot be satisfied the job remains pending. MAX_RADIUS implies that the job cannot span multiple hosts. LSF puts each cpuset host into its own group to enforce this when MAX_RADIUS is specified.

  • RESUME_OPTION=ORIG_CPUS: Specifies how LSF should recreate a cpuset when a job is resumed. By default, LSF tries to create the original cpuset when a job resumes. If this fails, LSF tries to create a new cpuset based on the default memory option. ORIG_CPUS specifies that the job must be run on the original cpuset when it resumes. If this fails, the job remains suspended.

  • CPU_LIST=cpu_ID_list: cpu_ID_list is a list of CPU IDs separated by commas. The CPU ID is a positive integer or a range of integers. If incorrect CPU IDs are specified, the job remains pending until the specified CPUs are available. You must specify at least as many CPU IDs as the number of CPUs the job requires (bsub -n). If you specify more CPU IDs than the job requests, LSF selects the best CPUs from the list.

  • CPUSET_OPTIONS=option_list: option_list is a list of cpuset attributes joined by a pipe (|). If incorrect cpuset attributes are specified, the job is rejected. See Cpuset attributes for supported cpuset options.

  • MAX_CPU_PER_NODE=max_num_cpus: max_num_cpus is the maximum number of CPUs on any one node that will be used by this job. Cannot be used with the NODE_EX option.

  • MEM_LIST=mem_node_list: mem_node_list is a list of memory node IDs separated by commas. The memory node ID is a positive integer or a range of integers. For example:

    "CPUSET[MEM_LIST=0,1-2]"

    Incorrect memory node IDs or unavailable memory nodes are ignored when LSF allocates the cpuset.

  • NODE_EX=Y | N: Allocates whole nodes for the cpuset job. This option cannot be used with the MAX_CPU_PER_NODE option.

When a job is submitted using -extsched, LSF creates a cpuset with the specified CPUs and cpuset attributes and attaches it to the processes of the job. The job is then scheduled and dispatched.

Running jobs on specific CPUs

The CPUs available for your jobs may have specific features you need to take advantage of (for example, some CPUs may have more memory, others have a faster processor). You can partition your machines to use specific CPUs for your jobs, but the cpusets for your jobs cannot cross hosts, and you must run multiple operating systems

You can create static cpusets with the particular CPUs your jobs need, but you cannot control the specific CPUs in the cpuset that the job actually uses.

A better solution is to use the CPU_LIST external scheduler option to request specific CPUs for your jobs. LSF can choose the best set of CPUs from the CPU list to create a cpuset for the job. The best cpuset is the one with the smallest CPU radius that meets the CPU requirements of the job. CPU radius is determined by the processor topology of the system and is expressed in terms of the number of router hops between CPUs.

To make job submission easier, you should define queues with the specific CPU_LIST requirements. Set CPU_LIST in MANDATORY_EXTSCHED or DEFAULT_EXTSCHED option in your queue definitions in lsb.queues. CPU_LIST is interpreted as a list of possible CPU selections, not a strict requirement. For example, if you subit a job with the the -R "span[ptile]" option:

bsub -R "span[ptile=1]" -ext "CPUSET[CPU_LIST=1,3]" -n2 ...

the following combination of CPUs is possible:

CPUs on host 1

CPUs on host 2

1

1

1

3.

3

1

3

3

Cpuset attributes

The following cpuset attributes are supported in the list of cpuset options specified by CPUSET_OPTIONS:

  • CPUSET_CPU_EXCLUSIVE: Defines a restricted cpuset.

  • CPUSET_MEMORY_LOCAL: Threads assigned to the cpuset attempt to assign memory only from nodes within the cpuset. Overrides the MEM_LIST cpuset option.

  • CPUSET_MEMORY_EXCLUSIVE: Threads not assigned to the cpuset do not use memory from within the cpuset unless no memory outside the cpuset is available.

  • CPUSET_MEMORY_KERNEL_AVOID: Kernel attempts to avoid allocating memory from nodes contained in this cpuset.

  • CPUSET_MEMORY_MANDATORY: Kernel limits all memory allocations to nodes contained in this cpuset.

  • CPUSET_POLICY_PAGE: Causes the kernel to page user pages to the swap file to free physical memory on the nodes contained in this cpuset. This is the default policy if no other policy is specified. Requires CPUSET_MEMORY_MANDATORY.

  • CPUSET_POLICY_KILL: The kernel attempts to free as much space as possible from kernel heaps, but will not page user pages to the swap file. Requires CPUSET_MEMORY_MANDATORY.

Restrictions on CPUSET_MEMORY_MANDATORY are:

  • CPUSET_OPTIONS=CPUSET_MEMORY_MANDATORY implies node-level allocation.

  • CPUSET_OPTIONS=CPUSET_MEMORY_MANDATORY cannot be used together with MAX_CPU_PER_NODE=max_num_cpus.

You should not use the MPI_DSM_MUSTRUN=ON environment variable. If a job is suspended through preemption, LSF can ensure that cpusets are recreated with the same CPUs, but it cannot ensure that a certain task will run on a specific CPU. Jobs running with MPI_DSM_MUSTRUN cannot migrate to a different part of the machine. MPI_DSM_MUSTRUN also interferes with job checkpointing.

Including memory nodes in the allocation

When you specify a list of memory node IDs with the cpuset external scheduler option MEM_LIST, LSF creates a cpuset for the job that includes the memory nodes specified by MEM_LIST in addition to the local memory attached to the CPUs allocated for the cpuset. For example, if "CPUSET[MEM_LIST=30-40]", and a 2-CPU parallel job is scheduled to run on CPU 0-1 (physically located on node 0), the job is able to use memory on node 0 and nodes 30-40.

Unavailable memory nodes listed in MEM_LIST are ignored when LSF allocates the cpuset. For example, a 4-CPU job across two hosts (hostA and hostB) that specifies MEM_LIST=1 allocates 2 CPUs on each host. The job is scheduled as follows:

  • CPU 0 and CPU 1 (memory=node 0, node 1) on hostA

  • CPU 0 and CPU 1 (memory=node 0, node 1) on hostB

If hostB only has 2 CPUs, only node 0 is available, and the job will only use the memory on node 0.

MEM_LIST is only available for dynamic cpuset jobs at both the queue level and the command level. When both MEM_LIST and CPUSET_OPTIONS=CPUSET_MEMORY_LOCAL are both specified for the job, the root cpuset nodes are included as the memory nodes for the cpuset. MEM_LIST is ignored, and CPUSET_MEMORY_LOCAL overrides MEM_LIST.

If LSB_CPUSET_BESTCPUS is set in lsf.conf, LSF can choose the best set of CPUs that can create a cpuset. The best cpuset is the one with the smallest CPU radius that meets the CPU requirements of the job. CPU radius is determined by the processor topology of the system and is expressed in terms of the number of router hops between CPUs. For better performance, CPUs connected by metarouters are given a relatively high weights so that they are the last to be allocated.

Best-fit and first-fit CPU list

By default, LSB_CPUSET_BESTCPUS=Y is set in lsf.conf. LSF applies a best-fit algorithm to select the best CPUs available for the cpuset. For example, the following command creates an exclusive cpuset with the 8 best CPUs if available:

bsub -n 8 -extsched "CPUSET[CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE]" myjob

If LSB_CPUSET_BESTCPUS is not set in lsf.conf, LSF builds a CPU list on a first- fit basis; in this example, the first 8 available CPUs are used.

Use the MAX_RADIUS cpuset external scheduler option to specify the maximum radius for dynamic cpuset allocation. If LSF cannot allocate a cpuset with radius less than or equal to MAX_RADIUS, the job remains pending. MAX_RADIUS implies that the job cannot span multiple hosts. LSF puts each cpuset host into its own group to enforce this when MAX_RADIUS is specified.

The following table shows how the best CPUs are selected:

CPU_LIST

MAX_RADIUS

LSB_CPUSET_BESTCPUS

Algorithm used

Applied to

Specified

Specified or not specified

N

First fit

CPUs in CPU_LIST

Not specified

Specified or not specified

N

First fit

All cpus in system

Specified

Specified

Y

Max radius

CPUs in CPU_LIST

Not specified

Specified

Y

Max radius

All cpus in system

Specified

Not specified

Y

Best fit

CPUs in CPU_LIST

Not specified

Not specified

Y

Best fit

All cpus in system

How cpuset jobs are suspended and resumed

When a cpuset job is suspended (for example, with bstop), job processes are moved out of the cpuset and the job cpuset is destroyed. LSF keeps track of which processes belong to the cpuset, and attempts to recreate a job cpuset when a job is resumed, and binds the job processes to the cpuset.

When a job is resumed, regardless of how it was suspended, the RESUME_OPTION is honored. If RESUME_OPTION=ORIG_CPUS then LSF first tries to get the original CPUs from the same nodes as the original cpuset in order to use the same memory. If this does not get enough CPUs to resume the job, LSF tries to get any CPUs in an effort to get the job resumed.

SGI supports memory migration and does not require additional configuration to enable this feature. If you submit and then suspend a job using a dynamic cpuset, LSF will create a new dynamic cpuset when the job resumes. The memory pages for the job are migrated to the new cpuset as required.

For example, assume a host with 2 nodes, 2 CPUs per node (total of 4 CPUs):

Node

CPUs

0

0 1

1

2 3

When a job running within a cpuset that contains cpu 1 is suspended:

  1. The job processes are detached from the cpuset and suspended.

  2. The cpuset is destroyed.

When the job is resumed:

  1. A cpuset with the same name is recreated.

  2. The processes are resumed and attached to the cpuset.

The RESUME_OPTION parameter determines which CPUs are used to recreate the cpuset:

  • If RESUME_OPTION=ORIG_CPUS, only CPUs from the same nodes originally used are selected.

  • If RESUME_OPTION is not ORIG_CPUS LSF will first attempt to use cpus from the original nodes to minimize memory latency. If this is not possible, any free CPUs from the host will be considered.

If the job originally had a cpuset containing cpu 1, the possibilities when the job is resumed are:

RESUME_OPTION

Eligible CPUs

ORIG_CPUS

0 1

not ORIG_CPUS

0 1 2 3

Viewing cpuset information for your jobs

The bacct -l, bjobs -l, and bhist -l commands display the following information for jobs:

  • CPUSET_TYPE=static | dynamic | none

  • NHOSTS=number

  • HOST=host_name

  • CPUSET_NAME=cpuset_name

  • NCPUS=num_cpus: The number of actual CPUs in the cpuset; can be greater than the number of slots.

For example:

bjobs -l 221
 
Job <221>, User <user1>, Project <default>, Status <DONE>, Queue <normal>, Command <myjob>
Thu Dec 15 14:19:54 2009: Submitted from host <host>, CWD <$HOME>, 2 Processors Requested; 
Thu Dec 15 14:19:57 2009: Started on 2 Hosts/Processors <2*hostA>,
                          Execution Home </home/user1>, Execution CWD
                          </home/user1>
Thu Dec 15 14:19:57 2009: 
CPUSET_TYPE=dynamic;NHOSTS=1;HOST=hostA;CPUSET_NAME=
                     /reg62@221;NCPUS=2; 
Thu Dec 15 14:20:03 2009: Done successfully. The CPU time used is 0.0 seconds
 
SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp   mem
 loadSched   -     -     -     -       -     -    -     -     -      -     -
 loadStop    -     -     -     -       -     -    -     -     -      -     -
 
 EXTERNAL MESSAGES:
 MSG_ID FROM       POST_TIME      MESSAGE              ATTACHMENT
 0        -           -              -                     -
 1        -           -              -                     -
 2      root       Dec 15 14:19   JID=0x118f; ASH=0x0      N
 
bhist -l 221
Job <221>, User <user1>, Project <default>, Command <myjob>
Thu Dec 15 14:19:54 2009: Submitted from host <hostA>, to Queue <normal>,
                          CWD <$HOME>, 2 Processors Requested;
Thu Dec 15 14:19:57 2009: Dispatched to 2 Hosts/Processors <2*hostA>
Thu Dec 15 14:19:57 2009: 
CPUSET_TYPE=dynamic;NHOSTS=1;HOST=hostA
                     ;CPUSET_NAME=/reg62@221;NCPUS=2; 
Thu Dec 15 14:19:57 2009: Starting (Pid 4495); 
Thu Dec 15 14:19:57 2009: External Message "JID=0x118f; ASH=0x0" was posted 
from "root" to message box 2; 
Thu Dec 15 14:20:01 2009: Running with execution home </home/user1>
Execution CWD </home/user1>, Execution Pid <4495>
Thu Dec 15 14:20:01 2009: Done successfully. The CPU time used is 0.0 seconds
Thu Dec 15 14:20:03 2009: Post job process done successfully;
 
Summary of time in seconds spent in various states by  Thu Dec 15 14:20:03
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  3        0        4        0        0        0        7
 
bacct -l 221
Accounting information about jobs that are:
  - submitted by all users.
  - accounted on all projects.
  - completed normally or exited
  - executed on all hosts.
  - submitted to all queues.
  - accounted on all service classes.
 
Job <221>, User <user1>, Project <default>, Status <DONE>, Queue <normal>, Command <myjob>
Thu Dec 15 14:19:54 2009: Submitted from host <hostA>, CWD <$HOME>
Thu Dec 15 14:19:57 2009: Dispatched to 2 Hosts/Processors <2*hostA>
Thu Dec 15 14:19:57 2009: 
CPUSET_TYPE=dynamic;NHOSTS=1;HOST=hostA;CPUSET_NAME=/reg62@221;NCPUS=2; 
Thu Dec 15 14:20:01 2009: Completed <done>
 
Accounting information about this job:
     CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP
      0.03        3              7     done         0.0042     0K      0K
 
SUMMARY:      ( time unit: second )
 Total number of done jobs:       1      Total number of exited jobs:     0
 Total CPU time consumed:       0.0      Average CPU time consumed:     0.0
 Maximum CPU time of a job:     0.0      Minimum CPU time of a job:     0.0
 Total wait time in queues:     3.0
 Average wait time in queue:    3.0
 Maximum wait time in queue:    3.0      Minimum wait time in queue:    3.0
 Average turnaround time:         7      (seconds/job)
 Maximum turnaround time:         7      Minimum turnaround time:         7
 Average hog factor of a job:  0.00      (cpu time / turnaround time)
 Maximum hog factor of a job:  0.00      Minimum hog factor of a job:  0.00

Use brlainfo to display topology information for a cpuset host. It displays:

  • Cpuset host name

  • Cpuset host type

  • Total number of CPUs

  • Free CPUs

  • Total number of nodes

  • Free CPUs per node

  • Available CPUs with a given radius

  • List of static cpusets

For example:

brlainfo
HOSTNAME   CPUSET_OS  NCPUS  NFREECPUS NNODES  NCPU/NODE NSTATIC_CPUSETS
hostA      Linux  x64 10      2         1       2         0
hostB      Linux  x64  4      4         2       2         0
hostC      Linux  x64  4      3         2       2         0
 
brlainfo -l
HOST: hostC
CPUSET_OS   NCPUS  NFREECPUS NNODES  NCPU/NODE NSTATIC_CPUSETS
Linux x64   4      3         2       2         0
FREE CPU LIST: 0-2
NFREECPUS ON EACH NODE: 2/0,1/1
STATIC CPUSETS: NO STATIC CPUSETS
CPU_RADIUS: 2,3,3,3,3,3,3,3

The following are some examples:

  • To specify a dynamic cpuset:

    bsub -n 8 -extsched "CPUSET[CPUSET_TYPE=dynamic;CPU_LIST=1, 5, 7-12;]" myjob

  • If CPUSET_TYPE is not specified, the default cpuset type is dynamic, jobs are attached to a cpuset dynamically created by LSF. The cpuset is deleted when the job finishes or exits.

    bsub -R "span[hosts=1]" -n 8 -extsched "CPUSET[CPU_LIST=1, 5, 7-12;]" myjob

  • To specify a list of CPUs for an exclusive cpuset:

    bsub -n 8 -extsched "CPUSET[CPU_LIST=1, 5, 7-12;

    CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE|CPUSET_MEMORY_LOCAL]" myjob

    The job myjob will succeed if CPUs 1, 5, 7, 8, 9, 10, 11, and 12 are available

  • To specify a static cpuset:

    bsub -n 8 -extsched "CPUSET[CPUSET_TYPE=static; CPUSET_NAME=MYSET]" myjob

    Jobs are attached to a static cpuset specified by users at job submission. This cpuset is not deleted when the job finishes or exits.

  • Run a job without using any cpuset:

    bsub -n 8 -extsched "CPUSET[CPUSET_TYPE=none]" myjob

When using preemption, jobs can request static cpusets:

  • bsub -n 4 -q low rusage[scpus=4]" -extsched "CPUSET[CPUSET_NAME=MYSET]"

  • sleep 1000

  • bsub -n 4 -q low rusage[scpus=4]" -extsched "CPUSET[CPUSET_NAME=MYSET]"

  • sleep 1000

After these two jobs start running, submit a job to a high priority queue:

bsub -n 4 -q high rusage[scpus=4]" -

extsched "CPUSET[CPUSET_NAME=MYSET]"

sleep 1000

The most recent job running on the low priority queue (job 102) is preempted by the job submitted to the high priority queue (job 103):

bjobs
JOBID  USER    STAT  QUEUE   FROM_HOST  EXEC_HOST  JOB_NAME  SUBMIT_TIME
103    user1   RUN   high    hosta      4*hosta    *eep 1000 Jan 22 08:24
101    user1   RUN   low     hosta      4*hosta    *eep 1000 Jan 22 08:23
102    user1   SSUSP low     hosta      4*hosta    *eep 1000 Jan 22 08:23
 
bhosts -s
RESOURCE      TOTAL      RESERVED    LOCATION
dcpus         4.0          0.0        hosta
scpus         0.0          8.0        hosta

When using preemption, jobs can also request dynamic cpusets:

bsub -q high rusage[dcpus=1]" -n 3 -extsched "CPUSET[CPU_LIST=1,2,3]" sleep 1000
 
bhosts -s
RESOURCE       TOTAL    RESERVED    LOCATION
dcpus           3.0       1.0       hostA
scpus           8.0       0.0       hostA