Using Dynamic Cluster

This section contains information for all users of Dynamic Cluster. It describes how to submit and monitor jobs and demonstrates some basic commands to query job status and history.

Submitting Dynamic Cluster jobs

Configure Dynamic Cluster templates

The Dynamic Cluster template contains virtual machine template information necessary for Dynamic Cluster jobs.

Defining Dynamic Cluster jobs guarantees that your jobs will run on Dynamic Cluster hosts (that is, hosts that are marked dchost in the cluster file) if your jobs are VM jobs. This means that submitting Dynamic Cluster jobs using the DC_MACHINE_TYPE=vm application profile parameter or the dc_mtype=vm guarantees that jobs will run in Dynamic Cluster VMs.

There are two ways to submit Dynamic Cluster jobs:

  • Define Dynamic Cluster templates and other Dynamic Cluster parameters in the LSF application profile and submit the job with the application name.

  • Define Dynamic Cluster templates and other Dynamic Cluster parameters and submit the job with the bsub command.

    You cannot combine Dynamic Cluster parameters on the bsub command line with Dynamic Cluster parameters defined in the LSF application profile. If you define Dynamic Cluster templates on the command line, Dynamic Cluster parameters in the LSF application profile are ignored.

Submit jobs using application profiles

The following parameters are supported in lsb.applications:
DC_MACHINE_TEMPLATES=template_name...

Specify all Dynamic Cluster machine templates that this job can use. The Dynamic Cluster and LSF scheduler may run the job on any suitable host.

DC_MACHINE_TYPE=vm

Specify this parameter if you require a VM for the job. By default, the system provisions any machine.

DC_JOBVM_PREEMPTION_ACTION=savevm

Specify this parameter to save the VM when jobs from this application profile are preempted. By default, low priority jobs on the VM will not be considered as preemptable and will keep running until it completes.

Submit jobs using bsub

If the LSF application profile does not define the Dynamic Cluster machine template, the following options are supported with bsub:

-dc_tmpl template_name...

Specify the name of one or more Dynamic Cluster templates that the job can use. Using this option makes the job use Dynamic Cluster provisioning.

For example, to submit Dynamic Cluster jobs that can run on machines provisioned using the Dynamic Cluster template named "DC3" or "DC4":

-dc_tmpl "DC3 DC4"

When you define Dynamic Cluster templates on the command line, DC_MACHINE_TEMPLATES in lsb.applications is ignored.

-dc_mtype vm

If you used bsub -dc_tmpl, and you want the Dynamic Cluster job to be a VM job, you must use the bsub option -dc_mtype vm.

If no value is specified for -dc_mtype, Dynamic Cluster jobs run on any machine.

When you define Dynamic Cluster templates on the command line, DC_MACHINE_TYPE in lsb.applications is ignored.

-dc_vmaction action

If you used bsub -dc_tmpl and bsub -dc_mtype, and you want to specify an action on the VM if this job is preempted, you must use the bsub option -dc_vmaction action.

The following are a list of preemption actions that you can specify with this option:

  • -dc_vmaction savevm: Save the VM.

    Saving the VM allows this job to continue later on. This option defines the action that the lower priority (preempted) job should take upon preemption, not the one the higher priority (preempting) job should initiate.

  • -dc_vmaction livemigvm: Live migrate the VM (and the jobs running on them) from one hypervisor host to another.

    The system releases all resources normally used by the job from the hypervisor host, then migrates the job to the destination host without any detectable delay. During this time, the job remains in a RUN state.

  • -dc_vmaction requeuejob: Kill the VM job and resubmit it to the queue.

    The system kills the VM job and submits a new VM job request to the queue.

Note:

By default, a low priority VM job will not be preempted if this parameter is not configured. It will run to completion even if a higher priority job needs the VM resources.

When you define the preemption action on the command line, DC_VMJOB_PREEMPTION_ACTION in lsb.applications is ignored.

Find available templates and application profiles

To see information about available Dynamic Cluster templates, run the bdc command on your LSF master host:

# bdc tmpl
NAME                MACHINE_TYPE        RESGROUP
RH_VM_TMPL          VM                  KVMRedHat_Hosts
RH_KVM             -                   - 

To find application profiles that will submit Dynamic Cluster jobs, and see which templates they use, run the bapp command:

# bapp -l
APPLICATION NAME: AP_PM
 -- Dynamic Cluster PM template
STATISTICS:
   NJOBS     PEND      RUN    SSUSP    USUSP      RSV
        2        0        2        0        0        0
PARAMETERS:
DC_MACHINE_TYPE: PM
DC_MACHINE_TEMPLATES: DC_PM_TMPL 
-------------------------------------------------------------------------------
APPLICATION NAME: AP_VM
 -- Dynamic Cluster 1G VM template
STATISTICS:
   NJOBS     PEND      RUN    SSUSP    USUSP      RSV 
       0        0        0        0        0        0
PARAMETERS:
DC_MACHINE_TYPE: VM
DC_VMJOB_PREEMPTION_ACTION: savevm
DC_MACHINE_TEMPLATES: DC_VM_TMPL

Resizable or chunk jobs

You cannot submit Dynamic Cluster jobs as resizable or chunk jobs. LSF rejects any Dynamic Cluster job submissions with resizable or chunk job options.

Define virtual resources

When submitting a single Dynamic Cluster job to a virtual machine, LSF will by default request a single CPU virtual machine with at least 512MB of memory. Users who require more virtual CPUs or more memory for their jobs can request these resources using the following bsub command line options:
-n num_slots

Requests a virtual machine instance with num_slots CPUs. When Dynamic Cluster powers on the virtual machine allocated for this job, it sets its vCPUs attribute to match the number of slots requested using this parameter. The default value is 1.

Note:

In the current release, a VM job can only run on a single VM on a single host, therefore at least one host in your cluster should have num_slots physical processors.

-R "rusage[mem=integer]"

Specifies the memory requirement for the job, to make sure that it runs in a virtual machine with at least integer MB of memory allocated to it. This value determines the actual memory size of the virtual machine for the job, as defined by the DC_VM_MEMSIZE_DEFINED parameter in dc_conf.LSF_cluster_name.xml, and the DC_VM_MEMSIZE_STEP parameter in lsb.params .

Monitor Dynamic Cluster jobs

Use the LSF bjobs and bhist commands to check the status of Dynamic Cluster jobs.

Check the status of Dynamic Cluster jobs (bjobs)

bjobs -l indicates which virtual machine the Dynamic Cluster job is running on:

# bjobs -l 1936
Job <1936>, User <root>, Project <default>, Application <AP_vm>, Status <RUN>, 
                     Queue <normal>, Command <myjob>
Thu Jun  9 00:28:08: Submitted from host <vmodev04.corp.com>, CWD
                      </scratch/user1/testenv/lsf_dc/work/
                     cluster_dc_/dc>, Re-runnable;
Thu Jun  9 00:28:14: Started on <host003>, Execution Home </root>, Execution CWD
                      </scratch/user1/testenv/lsf_dc/work/
cluster_dc/dc>, Execution rusage <[mem=1024.00]>;
Thu Jun  9 00:28:14: Running on virtual machine <vm0>;
Thu Jun  9 00:29:01: Resource usage collected.
                     MEM: 3 Mbytes;  SWAP: 137 Mbytes;  NTHREAD: 4
                     PGID: 11710;  PIDs: 11710 11711 11713 

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -

Display machine provisioning information (bhist)

Use bhist -l to display machine provisioning information such as the history of provisioning requests initiated for the job, as well as their results:

# bhist -l 1936
Job <1936>, User <root>, Project <default>, Application <AP_vm>, Command <myjob>
Thu Jun  9 00:28:08: Submitted from host <vmodev04.corp.com>, to 
                     Queue <normal>, CWD </scratch/user1/testenv/lsf_dc/work/
                     cluster_dc/dc>, Re-runnable;
Thu Jun  9 00:28:14: Provision <1> requested on 1 Hosts/Processors <host003>;
Thu Jun  9 00:28:14: Provision <1> completed; Waiting 1 Hosts/Processors <vm0> 
                     ready;
Thu Jun  9 00:28:14: Dispatched to <vm0>;
Thu Jun  9 00:28:14: Starting (Pid 11710);
Thu Jun  9 00:28:14: Running with execution home </root>,
                     Execution CWD </scratch/user1/testenv/lsf_dc/work/
                     cluster_dc/dc>, Execution Pid <11710>,Execution rusage <[mem=1024.00]>;
Summary of time in seconds spent in various states by  Thu Jun  9 00:30:53
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  6        0        159      0        0        0        165 

Display recent provisioning action information (bdc action)

Use bdc action to display information about recent provisioning actions. This command shows information from memory.

# bdc action
REQ_ID  JOB_ID       STATUS  BEGIN                END                  NACT
10075   10449        done    Thu Apr  4 15:45:00  Thu Apr  4 15:45:26  1
10076   10461        done    Thu Apr  4 15:45:06  Thu Apr  4 15:46:06  2
10077   -            done    Thu Apr  4 15:45:26  Thu Apr  4 15:45:56  1
10078   -            done    Thu Apr  4 15:45:56  Thu Apr  4 15:46:26  1
10079   10461        done    Thu Apr  4 15:48:56  Thu Apr  4 15:49:46  1
10080   10453        done    Thu Apr  4 15:55:46  Thu Apr  4 15:56:16  1
10081   10456        done    Thu Apr  4 15:55:46  Thu Apr  4 15:56:16  1
10082   10454        done    Thu Apr  4 15:55:46  Thu Apr  4 15:56:16  1

Use bdc action -p prov_id to display information about a specific provisioning action by specifying its provisioning ID:

# bdc action -p 10101
REQ_ID<10101>
JOB_ID       STATUS  BEGIN                     END                       NACT
10472        done    Thu Apr  4 15:56:06 2013  Thu Apr  4 15:57:26 2013  2

HOSTS           i43

<Action details>
ACTIONID       1.1.1
ACTION         INSTALL_VM
STATUS         done
TARGET
HOSTNAME
DC_TEMPLATE    rhel62
HYPERVISOR     i43

ACTIONID       1.2.1
ACTION         NEW_VM
STATUS         done
TARGET         45edf7a9-2651-4798-8476-5e096823e1a2
HOSTNAME       platdemodc23
DC_TEMPLATE    rhel62
HYPERVISOR     i43

Use bdc action -j job_id to display the provisioning actions associated with a specific job by specifying its job ID and bdc action -l -j job_id to display details on the provisioning actions associated with the specific job.

# bdc action -j 10462
REQ_ID  JOB_ID       STATUS  BEGIN                END                  NACT
10091   10462        done    Thu Apr  4 15:56:06  Thu Apr  4 15:57:06  2
10124   10462        done    Thu Apr  4 16:00:15  Thu Apr  4 16:01:44  1
# bdc action -l -j 10462
REQ_ID<10091>
JOB_ID       STATUS  BEGIN                     END                       NACT
10462        done    Thu Apr  4 15:56:06 2013  Thu Apr  4 15:57:06 2013  2

HOSTS           i42

<Action details>
ACTIONID       1.1.1
ACTION         INSTALL_VM
STATUS         done
TARGET
HOSTNAME
DC_TEMPLATE    rhel62
HYPERVISOR     i42

ACTIONID       1.2.1
ACTION         NEW_VM
STATUS         done
TARGET         c8ee6200-f080-47ca-8595-4f44d647cb30
HOSTNAME       platdemodc3
DC_TEMPLATE    rhel62
HYPERVISOR     i42

REQ_ID<10124>
JOB_ID       STATUS  BEGIN                     END                       NACT
10462        done    Thu Apr  4 16:00:15 2013  Thu Apr  4 16:01:44 2013  1

HOSTS           i42

<Action details>
ACTIONID       1.1.1
ACTION         CHECKPOINT_VM
STATUS         done
TARGET         c8ee6200-f080-47ca-8595-4f44d647cb30
HOSTNAME       platdemodc3
DC_TEMPLATE
HYPERVISOR     i42

Display machine provisioning request history (bdc hist)

Use bdc hist to display historic information about machine provisioning requests. This command shows information from the event log files. The options for this command are similar to bdc action, including the use of -p to display information on a specific provisioning action and -j to display information on a specific job. However, if a provisioning action fails, bdc hist also shows the error message from Platform Cluster Manager. For example, the last line in the following output is the error message from Platform Cluster Manager:

# bdc hist -l -p 4029
Provision request <4029> for Job <7653>
Wed Mar  6 12:59:02: Requested on 1 Hosts <hb05b15.mc.platformlab.ibm.com>; Power on 1 Machine with
                     Template <WIN2K8> Processors <1> Memory <1024 MB>
Wed Mar  6 12:59:03: Requested Power on 1 Machine <1ac50039-7851-4f30-acd3-c5b4701afd48>
Wed Mar  6 12:59:19: Failed Power on 1 Machine <1ac50039-7851-4f30-acd3-c5b4701afd48>
Wed Mar  6 12:59:19: Request failed: com.platform.rfi.manager.exceptions.RFIMachineNotFoundException
                     : Machine ID 1ac50039-7851-4f30-acd3-c5b4701afd48 is not found.

Commands with job counters combine PROV and RUN job states

Certain LSF commands report job details, while others only report counters (for example, 10 RUN jobs, 15 PEND jobs). The commands that only report counters, which includes bqueues and bapp, treat PROV jobs as identical to RUN jobs, so the counter for RUN jobs also includes PROV jobs. This is because PROV is a special type of RUN job: it is basically a job in a RUN state with an active provision action.

For example, if there are 10 RUN jobs and 10 PROV jobs, commands that report job details (such as bjobs and bhist) report 10 RUN jobs and 10 PROV jobs, while commands that report job counters (such as bqueues and bapp) report 20 RUN jobs.