Controlling CPU and memory affinity for NUMA hosts

Start of change Platform LSF can schedule jobs that are affinity aware. This allows jobs to take advantage of different levels of processing units (NUMA nodes, sockets, cores, and threads). Affinity scheduling is supported only on Linux and Power 7 and Power 8 hosts. Affinity scheduling is supported in Platform LSF Standard Edition and Platform LSF Advanced Edition. Affinity scheduling is not supported on Platform LSF Express Edition. End of change

An affinity resource requirement string specifies CPU or memory binding requirements for the tasks of jobs requiring topology-aware scheduling. An affinity[] resource requirement section controls CPU and memory resource allocations and specifies the distribution of processor units within a host according to the hardware topology information that LSF collects. The syntax supports basic affinity requirements for sequential jobs, as well as very complex task affinity requirements for parallel jobs.

affinity sections are accepted by bsub -R, and by bmod -R for non-running jobs, and can be specified in the RES_REQ parameter in lsb.applications and lsb.queues. Job-level affinity resource requirements take precedence over application-level requirements, which in turn override queue-level requirements.

You can use bmod to modify affinity resource requirements. After using bmod to modify memory resource usage of a running job with affinity requirements, bhosts -l -aff may show some inconsistency between host-level memory and available memory in NUMA nodes. The modified memory resource requirement takes effect in the next scheduling cycle of the job for bhosts -aff display, but it takes effect immediately at host level.

Enabling affinity scheduling

Enable CPU and memory affinity scheduling with the AFFINITY keyword in lsb.hosts.

Make sure that the affinity scheduling plugin scmod_affinity is defined in lsb.modules.
Begin PluginModule
SCH_PLUGIN         RB_PLUGIN      SCH_DISABLE_PHASES
schmod_default       ()              ()
...
schmod_affinity      ()              ()
End PluginModule

Limitations and known issues

CPU and memory affinity scheduling has the following limitations.
  • Affinity resources cannot be released during preemption, so you should configure mem as a preemptable resource in lsb.params
  • When a job with affinity resources allocated has been stopped with bstop, the allocated affinity resources (thread, core, socket, NUMA node, NUMA memory) will not be released.
  • Affinity scheduling is disabled for hosts with cpuset scheduling enabled, and on Cray Linux hosts.
  • When reservation is enabled, affinity reservation allocations appear as part of the allocated resources in bhosts -aff

    Jobs that are submitted with a membind=localprefer binding policy may overcommit the memory of the NUMA node they are allocated to .

    bhosts -aff output may occasionally show the total allocated memory on the NUMA nodes of a host as exceeding the maximum memory of the host, this is because the reservations that show in bhosts -aff overcommit the NUMA node. However, LSF will never allow the allocation of running jobs on a host to exceed the maximum memory of a host.

  • When reservation is enabled, and an affinity job requests enough resources to consume an entire node in the host topology. (for example, enough cores to consume an entire socket), LSF will not reserve the socket for the job if there are any jobs running on its cores. In a situation when there are always smaller jobs running consuming cores, then larger jobs that require entire sockets will not be able to reserve resources. The workaround is to require that all jobs have estimated run times, and to use time-based reservation.