Managing IBM Parallel Environment jobs through LSF

Modifying network scheduling options for Parallel Environment jobs

Use the bmod -network option to modify the network scheduling options for submitted IBM Parallel Environment (PE) jobs. The bmod -networkn option removes any network scheduling options for the PE job.

You cannot modify the network scheduling options for running jobs, even if LSB_MOD_ALL_JOBS=y is defined.

Network resource information (lsload -l)

If LSF_PE_NETWORK_NUM is set to a value greater than zero in lsf.conf, LSF collects network information for scheduling IBM Parallel Environment (PE) jobs. Two string resources are created for PE jobs:
pe_network

A host-based string resource that contains the network ID and the number of network windows available on the network.

pnsd

Set to Y if the PE network resource daemon pnsd responds successfully, or N if there is no response. PE jobs can only run on hosts with pnsd installed and running.

lsload -l displays the value of these two resources and shows network information for PE jobs. For example, the following lsload command displays network information for hostA and hostB, both of which have 2 networks available. Each network has 256 windows, and pnsd is responsive on both hosts. In this case, LSF_PE_NETWORK_NUM=2 should be set in lsf.conf:
lsload -l
HOST_NAME   status  r15s   r1m  r15m   ut    pg    io  ls    it   tmp   swp   mem   pnsd
pe_network                                 
hostA               ok   1.0   0.1   0.2  10%   0.0     4  12     1   33G 4041M 2208M  Y
ID= 1111111,win=256;ID= 2222222,win=256
hostB               ok   1.0   0.1   0.2  10%   0.0     4  12     1   33G 4041M 2208M  Y
ID= 1111111,win=256;ID= 2222222,win=256
Use bjobs -l to displays network resource information for submitted PE jobs. For example:
bjobs -l
Job <2106>, User <user1>;, Project <default>;, Status <RUN>;, Queue <normal>, Co
                     mmand <my_pe_job>
Fri Jun  1 20:44:42: Submitted from host <hostA>, CWD <$HOME>, Requested Network
                      <protocol=mpi: mode=US: type=sn_all: instance=1: usage=dedicated>

If mode=IP is specified for the PE job, instance is not displayed.

Use bacct -l to display network resource allocations. For example:
bacct -l 210
Job <210>, User <user1>;, Project <default>, Status <DONE>. Queue <normal>,
                     Command <my_pe_job>
Tue Jul 17 06:10:28: Submitted from host <hostA>, CWD </home/pe_jobs>;
Tue Jul 17 06:10:31: Dispatched to <hostA>, Effective RES_REQ <select[type 
                     == local] order[r15s:pg] rusage[mem=1.00] >, PE Network 
                     ID <1111111>  <2222222> used <1> window(s)
                     per network per task;
Tue Jul 17 06:11:31: Completed <done>.
Use bhist -l to display historical information about network resource requirements and information about network allocations for PE jobs. For example:
bhist -l 749
Job <749>, User <user1>;, Project <default>, Command <my_pe_job>

Mon Jun  4 04:36:12: Submitted from host <hostB>, to Queue <
                     priority>, CWD <$HOME>, 2 Processors Requested, Network 
                     <protocols=mpi:mode=US: type=sn_all: instance=1:usage= dedicated>;
Mon Jun  4 04:36:15: Dispatched to 2 Hosts/Processors <hostB>,
                     Effective RES_REQ <select[ty
                     pe == local] rusage[nt1=1.00] >, PE Network 
                     ID <1111111>  <2222222> used <1> window(s)
                     per network per task;
Mon Jun  4 04:36:17: Starting (Pid 21006);
Use bhosts -l to display host-based network resource information for PE jobs. For example:
bhosts -l

...
PE NETWORK INFORMATION
NetworkID                       Status                      rsv_windows/total_windows   
1111111                         ok                                 4/64 
2222222                         closed_Dedicated                   4/64 

NetworkID is the physical network ID returned by PE.

Network Status is one of the following:
  • ok - normal status

  • closed_Full - all network windows are reserved

  • closed_Dedicated - a dedicated PE job is running on the network (usage=dedicated specified in the network resource requirement string)

  • unavail - network information is not available