Using LSF with LSTC LS-DYNA

LSF is integrated with products from Livermore Software Technology Corporation (LSTC). LS-DYNA jobs can use the checkpoint and restart features of LSF and take advantage of both SMP and distributed MPP parallel computation. To submit LS-DYNA jobs through LSF, you only need to make sure that your jobs are checkpointable.

Configuring LSF for LS-Dyna jobs

To configure LSF for DYNA jobs:

  • LSF HPC features must be enabled.

  • LS-DYNA version 960 and higher, available from LSTC, must be installed.

  • Optional: Hardware vendor-supplied MPI environment for network computing.

  • Optional: LSF MPI integration.

During installation, lsfinstall adds the Boolean resource ls_dyna to the Resource section of lsf.shared.

LSF also installs the echkpnt.ls_dyna and erestart.ls_dyna files in LSF_SERVERDIR.

If only some of your hosts can accept LS-DYNA jobs, configure the Host section of lsf.cluster.cluster_name to identify those hosts.

Edit LSF_ENVDIR/conf/lsf.cluster.cluster_name file and add the ls_dyn resource to the hosts that can run LS-DYNA jobs:

Begin Host
HOSTNAME    model   type  server  r1m   mem   swp  RESOURCES
...
hostA       !       !       1     3.5   ()    ()   ()
hostB       !       !       1     3.5   ()    ()   (ls_dyna)
hostC       !       !       1     3.5   ()    ()   ()
...
End Host

LS-DYNA integration with LSF checkpointing

LS-DYNA is integrated with LSF to use the LSF checkpointing capability. It uses application-level checkpointing, working with the functionality implemented by LS- DYNA. At the end of each time step, LS-DYNA looks for the existence of a checkpoint trigger file, named D3KIL. LS-DYNA jobs always exit with 0 even when checkpointing. LSF will report that the job has finished when it has checkpointed.

Use the bchkpnt command to create the checkpoint trigger file, D3KIL, which LS- DYNA reads. The file forces LS-DYNA to checkpoint, or checkpoint and exit itself. The existence of a D3KIL file and the checkpoint information that LSF writes to the checkpoint directory specified for the job are all LSF needs to restart the job.

Checkpointing and tracking of resources of SMP jobs is supported:

  • LSF installs echkpnt.ls_dyna and erestart.ls_dyna, which are special versions of echkpnt and erestart to allow checkpointing with LS-DYNA. Use bsub -a ls_dyna to make sure your job uses these files. The method name ls_dyna, uses the esub for LS-DYNA jobs, which sets the checkpointing method LSB_ECHKPNT_METHOD="ls_dyna" to use echkpnt.ls_dyna and erestart.ls_dyna.

  • When you submit a checkpointing job, you specify a checkpoint directory. Before the job starts running, LSF sets the environment variable LSB_CHKPNT_DIR to a subdirectory of the checkpoint directory specified in the command line, or the CHKPNT parameter in lsb.queues. This subdirectory is identified by the job ID and only contains files related to the submitted job.

    For checkpointing to work when running an LS-DYNA job from LSF, you must CD to the directory that LSF sets in $LSB_CHKPNT_DIR after submitting LS-DYNA jobs. You must change to this directory whether submitting a single job or multiple jobs. LS- DYNA puts all its output files in this directory.

  • When you checkpoint a job, LSF creates a checkpoint trigger file named D3KIL in the working directory of the job. The D3KIL file contains an entry depending on the desired checkpoint outcome:

    • sw1. causes the job to checkpoint and exit. LS-DYNA writes to a restart data file d3dump and exits.

    • sw3. causes the job to checkpoint and continue running. LS-Dyna writes to a restart data file d3dump and continues running until the next checkpoint.

    The other possible LS-Dyna switch parameters are not relevant to LSF checkpointing. LS-DYNA does not remove the D3KIL trigger file after checkpointing the job.

  • If a job is restarted, LSF attempts to restart the job with the -r restart_file option used to replace any existing -i or -r options in the original LS-DYNA command. LS-DYNA uses the checkpointed data to restart the process from that checkpoint point, rather than starting the entire job from the beginning.

    Each time a job is restarted, it is assigned a new job ID, and a new job subdirectory is created in the checkpoint directory. Files in the checkpoint directory are never deleted by LSF, but you may choose to remove old files once the LS-DYNA job is finished and the job history is no longer required.

Submitting LS-DYNA jobs

To submit DYNA jobs, redirect a job script to the standard input of bsub, including parameters required for checkpointing. With job scripts, you can manage two limitations of LS-DYNA job submissions:

  • When LS-DYNA jobs are restarted from a checkpoint, the job will use the checkpoint environment instead of the job submission environment. You can restore your job submission environment if you submit your job with a job script that includes your environment settings.

  • LS-DYNA jobs must run in the directory that LSF sets in the LSB_CHKPNT_DIR environment variable. This lets you submit multiple LS-DYNA jobs from the same directory but is also required if you are submitting one job. If you submit a job from a different directory, you must change to the $LSB_CHKPNT_DIR directory. You can do this if you submit your jobs with a job script.

    If you are running a single job or multiple jobs, all LS_DYNA jobs must run in the $LSB_CHKPNT_DIR directory.

To submit LS-DYNA jobs with job submission scripts, embed the LS-DYNA job in the job script. Use the following format to run the script:

% bsub < jobscrip

Inside your job scripts, the syntax for the bsub command to submit an LS-DYNA job is either of the following:

  • [-R ls_dyna] -k "checkpoint_dir method=ls_dyna" | -k "checkpoint_dir [checkpoint_period] method=ls_dyna" [bsub_options] LS_DYNA_command [LS_DYNA_options]

    Or:

    [-R ls_dyna] -a ls_dyna -k "checkpoint_dir" | -k "checkpoint_dir [checkpoint_period]" [bsub options] LS_DYNA_command [LS_DYNA_options]

  • -R ls_dyna: Optional. Specify the ls_dyna shared resource if the LS-DYNA application is only installed on certain hosts in the cluster.

  • method=ls_dyna: Mandatory. Use the esub for LS-DYNA jobs, which automatically sets the checkpoint method to ls_dyna to use the checkpoint and restart programs echkpnt.ls_dyna and erestart.ls_dyna. Alternatively, use bsub -a to specify the ls_dyna esub.

The checkpointing feature for LS-DYNA jobs requires all of the following parameters:

  • -k checkpoint_dir: Mandatory. Regular option to bsub that specifies the name of the checkpoint directory. Specify the ls_dyna method here if you do not use the bsub -a option.

  • checkpoint_period: Regular option to bsub that specifies the time interval in minutes that LSF will automatically checkpoint jobs.

  • LS_DYNA_command: Regular LS-DYNA software command and options.

Preparing your job scripts

To prepare your job scripts:

  • Specify any environment variables required for your LS-DYNA jobs. For example:

    LS_DYNA_ENV=VAL;export LS_DYNA_ENV

    If you do not set your environment variables in the job script, then you must add some lines to the script to restore environment variables. For example:

    if [ -f $LSB_CHKPNT_DIR/.envdump ]; then
    .$LSB_CHKPNT_DIR/.envdump
    fi
  • Ensure that your jobs run in the checkpoint directory set by LSF, by adding the following line after your bsub commands:

    cd $LSB_CHKPNT_DIR

  • Write the LS-DYNA command you want to run. For example:

    /usr/share/ls_dyna_path/ls960 endtime=2

    i=/usr/share/ls_dyna_path/airbag.deploy.k ncpu=1

Checkpointing, restarting, and migrating LS-DYNA jobs

  • The syntax for checkpointing is:

    bchkpnt [bchkpnt_options] [-k] [job_ID]

    Where:

    • -k specifies checkpoint and exit. The job will be killed immediately after being checkpointed. When the job is restarted, it continues from the last checkpoint.

    • job_ID is the job ID of the LS-DYNA job. Specifies which job to checkpoint. Each time the job is migrated, the job is restarted and assigned a new job ID.

  • The syntax for restarting is:

    brestart [brestart_options] checkpoint_directory [job_ID]

    Where:

    • checkpoint_directory specifies the checkpoint directory, where the job subdirectory is located. Each job is run in a unique directory. To change to the checkpoint directory for LSF to restart a job, place the following line in your job script before the LS-DYNA command is called cd $LSB_CHKPNT_DIR.

    • job_ID is the job ID of the LS-DYNA job. Specifies which job to restart. After the job is restarted, it is assigned a new job ID, and the new job ID is used for checkpointing. A new job ID is assigned each time the job is restarted.

  • The syntax for migrating is:

    bmig [bsub_options] [job_ID]

    Where:

    • job_ID is the job ID of the LS-DYNA job. Specifies which job to migrate. After the job is migrated, it is restarted and assigned a new job ID. The new job ID is used for checkpointing. A new job ID is assigned each time the job is migrated.