This section contains information for all users of Dynamic Cluster. It describes how to submit and monitor jobs and demonstrates some basic commands to query job status and history.
The Dynamic Cluster template contains virtual machine template information necessary for Dynamic Cluster jobs.
Defining Dynamic Cluster jobs guarantees that your jobs will run on Dynamic Cluster hosts (that is, hosts that are marked dchost in the cluster file) if your jobs are VM jobs. This means that submitting Dynamic Cluster jobs using the DC_MACHINE_TYPE=vm application profile parameter or the dc_mtype=vm guarantees that jobs will run in Dynamic Cluster VMs.
There are two ways to submit Dynamic Cluster jobs:
Define Dynamic Cluster templates and other Dynamic Cluster parameters in the LSF application profile and submit the job with the application name.
Define Dynamic Cluster templates and other Dynamic Cluster parameters and submit the job with the bsub command.
You cannot combine Dynamic Cluster parameters on the bsub command line with Dynamic Cluster parameters defined in the LSF application profile. If you define Dynamic Cluster templates on the command line, Dynamic Cluster parameters in the LSF application profile are ignored.
Specify all Dynamic Cluster machine templates that this job can use. The Dynamic Cluster and LSF scheduler may run the job on any suitable host.
Specify this parameter if you require a VM for the job. By default, the system provisions any machine.
Specify this parameter to save the VM when jobs from this application profile are preempted. By default, low priority jobs on the VM will not be considered as preemptable and will keep running until it completes.
If the LSF application profile does not define the Dynamic Cluster machine template, the following options are supported with bsub:
Specify the name of one or more Dynamic Cluster templates that the job can use. Using this option makes the job use Dynamic Cluster provisioning.
For example, to submit Dynamic Cluster jobs that can run on machines provisioned using the Dynamic Cluster template named "DC3" or "DC4":
-dc_tmpl "DC3 DC4"
When you define Dynamic Cluster templates on the command line, DC_MACHINE_TEMPLATES in lsb.applications is ignored.
If you used bsub -dc_tmpl, and you want the Dynamic Cluster job to be a VM job, you must use the bsub option -dc_mtype vm.
If no value is specified for -dc_mtype, Dynamic Cluster jobs run on any machine.
When you define Dynamic Cluster templates on the command line, DC_MACHINE_TYPE in lsb.applications is ignored.
If you used bsub -dc_tmpl and bsub -dc_mtype, and you want to specify an action on the VM if this job is preempted, you must use the bsub option -dc_vmaction action.
The following are a list of preemption actions that you can specify with this option:
-dc_vmaction savevm: Save the VM.
Saving the VM allows this job to continue later on. This option defines the action that the lower priority (preempted) job should take upon preemption, not the one the higher priority (preempting) job should initiate.
-dc_vmaction livemigvm: Live migrate the VM (and the jobs running on them) from one hypervisor host to another.
The system releases all resources normally used by the job from the hypervisor host, then migrates the job to the destination host without any detectable delay. During this time, the job remains in a RUN state.
-dc_vmaction requeuejob: Kill the VM job and resubmit it to the queue.
The system kills the VM job and submits a new VM job request to the queue.
By default, a low priority VM job will not be preempted if this parameter is not configured. It will run to completion even if a higher priority job needs the VM resources.
When you define the preemption action on the command line, DC_VMJOB_PREEMPTION_ACTION in lsb.applications is ignored.
To see information about available Dynamic Cluster templates, run the bdc command on your LSF master host:
# bdc tmpl
NAME MACHINE_TYPE RESGROUP
RH_VM_TMPL VM KVMRedHat_Hosts
RH_KVM - -
To find application profiles that will submit Dynamic Cluster jobs, and see which templates they use, run the bapp command:
# bapp -l
APPLICATION NAME: AP_PM
-- Dynamic Cluster PM template
STATISTICS:
NJOBS PEND RUN SSUSP USUSP RSV
2 0 2 0 0 0
PARAMETERS:
DC_MACHINE_TYPE: PM
DC_MACHINE_TEMPLATES: DC_PM_TMPL
-------------------------------------------------------------------------------
APPLICATION NAME: AP_VM
-- Dynamic Cluster 1G VM template
STATISTICS:
NJOBS PEND RUN SSUSP USUSP RSV
0 0 0 0 0 0
PARAMETERS:
DC_MACHINE_TYPE: VM
DC_VMJOB_PREEMPTION_ACTION: savevm
DC_MACHINE_TEMPLATES: DC_VM_TMPL
You cannot submit Dynamic Cluster jobs as resizable or chunk jobs. LSF rejects any Dynamic Cluster job submissions with resizable or chunk job options.
Requests a virtual machine instance with num_slots CPUs. When Dynamic Cluster powers on the virtual machine allocated for this job, it sets its vCPUs attribute to match the number of slots requested using this parameter. The default value is 1.
In the current release, a VM job can only run on a single VM on a single host, therefore at least one host in your cluster should have num_slots physical processors.
Specifies the memory requirement for the job, to make sure that it runs in a virtual machine with at least integer MB of memory allocated to it. This value determines the actual memory size of the virtual machine for the job, as defined by the DC_VM_MEMSIZE_DEFINED parameter in dc_conf.LSF_cluster_name.xml, and the DC_VM_MEMSIZE_STEP parameter in lsb.params .
Use the LSF bjobs and bhist commands to check the status of Dynamic Cluster jobs.
bjobs -l indicates which virtual machine the Dynamic Cluster job is running on:
# bjobs -l 1936
Job <1936>, User <root>, Project <default>, Application <AP_vm>, Status <RUN>,
Queue <normal>, Command <myjob>
Thu Jun 9 00:28:08: Submitted from host <vmodev04.corp.com>, CWD
</scratch/user1/testenv/lsf_dc/work/
cluster_dc_/dc>, Re-runnable;
Thu Jun 9 00:28:14: Started on <host003>, Execution Home </root>, Execution CWD
</scratch/user1/testenv/lsf_dc/work/
cluster_dc/dc>, Execution rusage <[mem=1024.00]>;
Thu Jun 9 00:28:14: Running on virtual machine <vm0>;
Thu Jun 9 00:29:01: Resource usage collected.
MEM: 3 Mbytes; SWAP: 137 Mbytes; NTHREAD: 4
PGID: 11710; PIDs: 11710 11711 11713
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
Use bhist -l to display machine provisioning information such as the history of provisioning requests initiated for the job, as well as their results:
# bhist -l 1936
Job <1936>, User <root>, Project <default>, Application <AP_vm>, Command <myjob>
Thu Jun 9 00:28:08: Submitted from host <vmodev04.corp.com>, to
Queue <normal>, CWD </scratch/user1/testenv/lsf_dc/work/
cluster_dc/dc>, Re-runnable;
Thu Jun 9 00:28:14: Provision <1> requested on 1 Hosts/Processors <host003>;
Thu Jun 9 00:28:14: Provision <1> completed; Waiting 1 Hosts/Processors <vm0>
ready;
Thu Jun 9 00:28:14: Dispatched to <vm0>;
Thu Jun 9 00:28:14: Starting (Pid 11710);
Thu Jun 9 00:28:14: Running with execution home </root>,
Execution CWD </scratch/user1/testenv/lsf_dc/work/
cluster_dc/dc>, Execution Pid <11710>,Execution rusage <[mem=1024.00]>;
Summary of time in seconds spent in various states by Thu Jun 9 00:30:53
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
6 0 159 0 0 0 165
Use bdc action to display information about recent provisioning actions. This command shows information from memory.
# bdc action
REQ_ID JOB_ID STATUS BEGIN END NACT
10075 10449 done Thu Apr 4 15:45:00 Thu Apr 4 15:45:26 1
10076 10461 done Thu Apr 4 15:45:06 Thu Apr 4 15:46:06 2
10077 - done Thu Apr 4 15:45:26 Thu Apr 4 15:45:56 1
10078 - done Thu Apr 4 15:45:56 Thu Apr 4 15:46:26 1
10079 10461 done Thu Apr 4 15:48:56 Thu Apr 4 15:49:46 1
10080 10453 done Thu Apr 4 15:55:46 Thu Apr 4 15:56:16 1
10081 10456 done Thu Apr 4 15:55:46 Thu Apr 4 15:56:16 1
10082 10454 done Thu Apr 4 15:55:46 Thu Apr 4 15:56:16 1
Use bdc action -p prov_id to display information about a specific provisioning action by specifying its provisioning ID:
# bdc action -p 10101
REQ_ID<10101>
JOB_ID STATUS BEGIN END NACT
10472 done Thu Apr 4 15:56:06 2013 Thu Apr 4 15:57:26 2013 2
HOSTS i43
<Action details>
ACTIONID 1.1.1
ACTION INSTALL_VM
STATUS done
TARGET
HOSTNAME
DC_TEMPLATE rhel62
HYPERVISOR i43
ACTIONID 1.2.1
ACTION NEW_VM
STATUS done
TARGET 45edf7a9-2651-4798-8476-5e096823e1a2
HOSTNAME platdemodc23
DC_TEMPLATE rhel62
HYPERVISOR i43
Use bdc action -j job_id to display the provisioning actions associated with a specific job by specifying its job ID and bdc action -l -j job_id to display details on the provisioning actions associated with the specific job.
# bdc action -j 10462
REQ_ID JOB_ID STATUS BEGIN END NACT
10091 10462 done Thu Apr 4 15:56:06 Thu Apr 4 15:57:06 2
10124 10462 done Thu Apr 4 16:00:15 Thu Apr 4 16:01:44 1
# bdc action -l -j 10462
REQ_ID<10091>
JOB_ID STATUS BEGIN END NACT
10462 done Thu Apr 4 15:56:06 2013 Thu Apr 4 15:57:06 2013 2
HOSTS i42
<Action details>
ACTIONID 1.1.1
ACTION INSTALL_VM
STATUS done
TARGET
HOSTNAME
DC_TEMPLATE rhel62
HYPERVISOR i42
ACTIONID 1.2.1
ACTION NEW_VM
STATUS done
TARGET c8ee6200-f080-47ca-8595-4f44d647cb30
HOSTNAME platdemodc3
DC_TEMPLATE rhel62
HYPERVISOR i42
REQ_ID<10124>
JOB_ID STATUS BEGIN END NACT
10462 done Thu Apr 4 16:00:15 2013 Thu Apr 4 16:01:44 2013 1
HOSTS i42
<Action details>
ACTIONID 1.1.1
ACTION CHECKPOINT_VM
STATUS done
TARGET c8ee6200-f080-47ca-8595-4f44d647cb30
HOSTNAME platdemodc3
DC_TEMPLATE
HYPERVISOR i42
Use bdc hist to display historic information about machine provisioning requests. This command shows information from the event log files. The options for this command are similar to bdc action, including the use of -p to display information on a specific provisioning action and -j to display information on a specific job. However, if a provisioning action fails, bdc hist also shows the error message from Platform Cluster Manager. For example, the last line in the following output is the error message from Platform Cluster Manager:
# bdc hist -l -p 4029
Provision request <4029> for Job <7653>
Wed Mar 6 12:59:02: Requested on 1 Hosts <hb05b15.mc.platformlab.ibm.com>; Power on 1 Machine with
Template <WIN2K8> Processors <1> Memory <1024 MB>
Wed Mar 6 12:59:03: Requested Power on 1 Machine <1ac50039-7851-4f30-acd3-c5b4701afd48>
Wed Mar 6 12:59:19: Failed Power on 1 Machine <1ac50039-7851-4f30-acd3-c5b4701afd48>
Wed Mar 6 12:59:19: Request failed: com.platform.rfi.manager.exceptions.RFIMachineNotFoundException
: Machine ID 1ac50039-7851-4f30-acd3-c5b4701afd48 is not found.
Certain LSF commands report job details, while others only report counters (for example, 10 RUN jobs, 15 PEND jobs). The commands that only report counters, which includes bqueues and bapp, treat PROV jobs as identical to RUN jobs, so the counter for RUN jobs also includes PROV jobs. This is because PROV is a special type of RUN job: it is basically a job in a RUN state with an active provision action.
For example, if there are 10 RUN jobs and 10 PROV jobs, commands that report job details (such as bjobs and bhist) report 10 RUN jobs and 10 PROV jobs, while commands that report job counters (such as bqueues and bapp) report 20 RUN jobs.