Using LSF with FLUENT

LSF is integrated with FLUENT products from ANSYS Inc., allowing FLUENT jobs to take advantage of the checkpointing and migration features provided by LSF. This increases the efficiency of the software and means data is processed faster. FLUENT 5 offers versions based on system vendors' parallel environments (usually MPI using the VMPI version of FLUENT 5.) Fluent also provides a parallel version of FLUENT 5 based on its own socket-based message passing library (the NET version). This chapter assumes you are already familiar with using FLUENT software and checkpointing jobs in LSF.

Configuring LSF for FLUENT

To configure LSF for FLUENT:

LSF HPC features must be enabled.
FLUENT 5 or higher, available from ANSYS Inc., must be installed.
(Optional) Hardware vendor-supplied MPI environment for network computing to use the "vmpi" version of FLUENT 5.

During installation, lsfinstall adds the Boolean resource fluent to the Resource section of lsf.shared.

LSF also installs the echkpnt.fluent and erestart.fluent files in LSF_SERVERDIR.

If only some of your hosts can accept FLUENT jobs, configure the Host section of lsf.cluster.cluster_name to identify those hosts.

Edit LSF_ENVDIR/conf/lsf.cluster.cluster_name file and add the fluent resource to the hosts that can run FLUENT jobs:

Begin Host
HOSTNAME    model   type  server  r1m   mem   swp  RESOURCES
...
hostA       !       !       1     3.5   ()    ()   ()
hostB       !       !       1     3.5   ()    ()   (fluent)
hostC       !       !       1     3.5   ()    ()   ()
...
End Host

Checkpointing in FLUENT

FLUENT 5 is integrated with LSF to use the LSF checkpointing capability. At the end of each iteration, FLUENT looks for the existence of a checkpoint file (check) or a checkpoint exit file (exit). If it detects the checkpoint file, it writes a case and data file, removes the checkpoint file, and continues iterating. If it detects a checkpoint exit file, it writes a case and data file, then exits.

Use the bchkpnt command to create the checkpoint and checkpoint exit files, which forces FLUENT to checkpoint, or checkpoint and exit itself. FLUENT also creates a journal file with instructions to read the checkpointed case and data files, and continue iterating. FLUENT uses this file when it is restarted with the brestart command.

LSF installs echkpnt.fluent and erestart.fluent, which are special versions of echkpnt and erestart to allow checkpointing with FLUENT. Use bsub -a fluent to make sure your job uses these files.

When you submit a checkpointing job, you specify a checkpoint directory. Before the job starts running, LSF sets the environment variable LSB_CHKPNT_DIR. The value of LSB_CHKPNT_DIR is a subdirectory of the checkpoint directory specified in the command line. This subdirectory is identified by the job ID and only contains files related to the submitted job.

When you checkpoint a FLUENT job, LSF creates a checkpoint trigger file (check) in the job subdirectory, which causes FLUENT to checkpoint and continue running. A special option is used to create a different trigger file (exit) to cause FLUENT to checkpoint and exit the job.

FLUENT uses the LSB_CHKPNT_DIR environment variable to determine the location of checkpoint trigger files. It checks the job subdirectory periodically while running the job. FLUENT does not perform any checkpointing unless it finds the LSF trigger file in the job subdirectory. FLUENT removes the trigger file after checkpointing the job.

If a job is restarted, LSF attempts to restart the job with the -restart option appended to the original FLUENT command. FLUENT uses the checkpointed data and case files to restart the process from that checkpoint, rather than repeating the entire process. Each time a job is restarted, it is assigned a new job ID, and a new job subdirectory is created in the checkpoint directory. Files in the checkpoint directory are never deleted by LSF, but you may choose to remove old files once the FLUENT job is finished and the job history is no longer required.

Submitting FLUENT jobs

Use bsub to submit the job, including parameters required for checkpointing. The syntax for the bsub command to submit a FLUENT job is:

[-R fluent] -a fluent [-k checkpoint_dir | -k "checkpoint_dir [checkpoint_period]" [bsub options] FLUENT command [FLUENT options] -lsf

Where:

-R fluent: Optional. Specify the fluent shared resource if the FLUENT application is only installed on certain hosts in the cluster.
-a fluent: Use the esub for FLUENT jobs, which automatically sets the checkpoint method to fluent to use the checkpoint and restart programs for FLUENT jobs, echkpnt.fluent and erestart.fluent.
-k checkpoint_dir: Regular option to bsub that specifies the name of the checkpoint directory.
checkpoint_period: Regular option to bsub that specifies the time interval in minutes that LSF will automatically checkpoint jobs.
FLUENT command: Regular command used with FLUENT software.
-lsf: Special option to the FLUENT command. Specifies that FLUENT is running under LSF, and causes FLUENT to check for trigger files in the checkpoint directory if the environment variable LSB_CHKPNT_DIR is set.

To submit a sequential FLUENT batch job, for example:

% bsub -a fluent fluent 3d -g -i journal_file -lsf

To submit parallel FLUENT net version batch job on 4 CPUs:

% bsub -a fluent -n 4 fluent 3d -t0 -pnet -g -i journal_file -lsf

Checkpointing, restarting and migrating FLUENT jobs

The syntax for checkpointing is:

bchkpnt [bchkpnt_options] [-k] [job_ID]
where:
- -k specifies checkpoint and exit. The job will be killed immediately after being checkpointed. When the job is restarted, it continues from the last checkpoint.
- job_ID is the job ID of the FLUENT job. Specifies which job to checkpoint. Each time the job is migrated, the job is restarted and assigned a new job ID.
The syntax for restarting is:

brestart [brestart options] checkpoint_directory [job_ID]

where job_ID is the FLUENT job and specifies which job to restart. At this point, the restarted job is assigned a new job ID, and the new job ID is used for checkpointing. The job ID changes each time the job is restarted.
The syntax for migrating is:

bmig [bsub_options] [job_ID]

where Job ID of the FLUENT job specifies which job to restart. At this point, the restarted job is assigned a new job ID, and the new job ID is used for checkpointing. The job ID changes each time the job is restarted.

Examples

For sequential FLUENT batch job with checkpoint and restart:

% bsub -a fluent -k "/home/username 60" fluent 3d -g -i journal_file -lsf

Submits a job that uses the checkpoint/restart method echkpnt.fluent and erestart.fluent, /home/username as the checkpoint directory, and a 60 minute duration between automatic checkpoints. FLUENT checks if there is a checkpoint trigger file /home/username/exit or /home/username/check.
% bchkpnt job_ID

echkpnt creates the checkpoint trigger file /home/username/check and waits until the file is removed and the checkpoint is successful. FLUENT writes a case and data file, and a restart journal file at the end of its current iteration. The files are saved in /home/username/job_ID and FLUENT continues to iterate. Use bjobs to verify that the job is still running after checkpoint.
% bchkpnt -k job_ID

echkpnt creates the checkpoint trigger file /home/username/exit and waits until the file is removed and the checkpoint is successful. FLUENT writes a case and data file, and a restart journal file at the end of its current iteration. The files are saved in /home/username/job_ID and FLUENT exits. Use bjobs to verify that the job is not running after checkpoint.
% brestart /home/username/job_ID

Starts a FLUENT job using the latest case and data files in /home/username/job_ID. The restart journal file /home/username/job_ID/#restart.inp is used to instruct FLUENT to read the latest case and data files and continue iterating.
Parallel FLUENT VMPI version batch job with checkpoint and restart on 4 CPUs:

% bsub -a fluent -k "/home/username 60" -n 4 fluent 3d -t4 -pvmpi -g -i journal_file -lsf % bchkpnt -k job_ID

Forces FLUENT to write a case and data file, and a restart journal file at the end of its current iteration. The files are saved in /home/username/job_ID and FLUENT exits.
% brestart /home/username/job_ID

Starts a FLUENT job using the latest case and data files in /home/username/job_ID. The restart journal file /home/username/job_ID/#restart.inp is used to instruct FLUENT to read the latest case and data files and continue iterating.

The parallel job is restarted using the same number of processors (4) requested in the original bsub submission.
% bmig -m hostA 0

All jobs on hostA are checkpointed and moved to another host.