This chapter describes the IBM Platform Dynamic Cluster (Dynamic Cluster) system architecture and basic concepts. It also explains the benefits of using Dynamic Cluster, and explains some of the concepts required to effectively administer the IBM Platform LSF (LSF) cluster with Dynamic Cluster enabled.
This guide assumes that you have a good knowledge of standard LSF features, as well as a working familiarity with common virtual infrastructure concepts such as hypervisors and virtual machines.
The following figure shows the high-level architecture of Dynamic Cluster.
PVMO (Physical and Virtual Machine Orchestrator) is the component of Platform Cluster Manager that interacts with the underlying virtual provisioning systems. The Platform Cluster Manager master host cannot be used to run virtual machines.
This means that an installation of Dynamic Cluster requires an installation of both LSF and Platform Cluster Manager. You can configure which hosts in your cluster can participate in dynamic provisioning. This allows you to isolate Dynamic Cluster functionality to as small or as large a subset of a standard LSF cluster as you wish.
Dynamic Cluster supports virtual machine provisioning. The LSF scheduler enabled with Dynamic Cluster module is provisioning aware, so there are no race conditions or issues with "two-brain" scheduling.
Without Dynamic Cluster, LSF finds suitable resources and schedules jobs, but the resource attributes are fixed, and some jobs may be pending while resources that do not match the job’s requirements are idle. With Dynamic Cluster, idle resources that do not match job requirements are dynamically repurposed, so that LSF can schedule the pending jobs.
Dynamic Cluster can provision the machine type that is most appropriate for the workload:
The VM memory and CPU allocations can be modified when powering them on.
Dynamic Cluster hosts in the cluster are flexible resources. If workload requirements are constantly changing, and different types of workload require different execution environments, Dynamic Cluster can dynamically provision infrastructure according to workload needs (OS, memory, CPUs).
With Dynamic Cluster you can keep hardware and license utilization high, without affecting the service level for high priority workload. Instead of reserving important resources for critical workload, you can use the resources to run low priority workload, and then preempt those jobs when a high priority job arrives.
Migration is driven by workload priority.
Without Dynamic Cluster, if the LSF job runs on a physical machine, the job is not mobile. The low-priority job must be terminated and rescheduled if it is preempted. If preemption occurs frequently, the job may be started and restarted several times, using up valuable resources without ever completing.
With Dynamic Cluster, the low-priority job can run on a VM, and if the job is preempted, the VM and job can be saved. When the VM is restored, the job continues and eventually finishes without wasting any resources.
Running workload is packed onto the hypervisor to use the smallest possible number of hypervisors. This maximizes availability for new jobs, and minimizes the need for migration.
Users of HPC applications cannot always predict the memory or CPU usage of a job. Without Dynamic Cluster, a job might unexpectedly use more resources than it asked for and interfere with other workload running on the execution host.
When Dynamic Cluster jobs run on a VM, one physical host can run many jobs, and the job is isolated in its environment. For example, a job that runs out of memory and fails will not interfere with other jobs running on the same host.
The post provisioning script only executes the first time the virtual machine is powered on. The post provisioning script no longer executes on subsequent boot ups.