Question 1

Single node jobscript

Accepted Answer

A simple script to run torch on a single node is the following:

#!/bin/sh ### General options ### -- specify queue -- #BSUB -q gpuv100 ### -- set the job Name -- #BSUB -J torch_single_node ### -- ask for number of cores (default: 1) -- #BSUB -n 4 ### -- specify that the cores MUST BE on a single host -- #BSUB -R \"span[hosts=1]\" ### -- Select the resources: 1 gpu in exclusive process mode -- #BSUB -gpu \"num=1:mode=exclusive_process\" ### -- set walltime limit: hh:mm -- maximum 24 hours for GPU-queues right now #BSUB -W 1:00 # request 5GB of system-memory #BSUB -R \"rusage[mem=5GB]\" ### -- set the email address -- # please uncomment the following line and put in your e-mail address, # if you want to receive e-mail notifications on a non-default address ##BSUB -u your_email_address ### -- send notification at start -- #BSUB -B ### -- send notification at completion-- #BSUB -N ### -- Specify the output and error file. %J is the job-id -- ### -- -o and -e mean append, -oo and -eo mean overwrite -- #BSUB -o torch_single_%J.out #BSUB -e torch_single_%J.err # -- end of LSF options -- #here load the modules, and activate the environment if needed #module load python3/... #source path-to-myenv/bin/activate # here call torchrun torchrun --standalone --nproc_per_node=1 your_python_script.py ...

The script asks for 4 cores (#BSUB -n 4) on a single node (#BSUB -R \"span[hosts=1]\") on the queue gpuv100 (#BSUB -q gpuv100), and 1 single GPU (#BSUB -gpu \"num=1:mode=exclusive_process\").

In the torch command line, the option --nproc_per_node=1 tells pytorch how many GPUs are available, so it must correspond to the number of GPU requested.

For all the other #BSUB options, please refer to the batch job page.

Question 2

Multiple nodes jobscript

Accepted Answer

A simple script to run torch on a two nodes is the following:

#!/bin/sh ### General options ### –- specify queue -- #BSUB -q gpuv100 ### -- set the job Name -- #BSUB -J torch_two_nodes ### -- ask for number of cores (default: 1) -- #BSUB -n 8 ### -- specify how many cores on each node -- #BSUB -R \"span[ptile=4]\" ### -- Select the resources: 2 gpu in exclusive process mode on each node -- #BSUB -gpu \"num=2:mode=exclusive_process\" ### -- set walltime limit: hh:mm -- maximum 24 hours for GPU-queues right now #BSUB -W 1:00 # request 5GB of system-memory #BSUB -R \"rusage[mem=5GB]\" ### -- set the email address -- # please uncomment the following line and put in your e-mail address, # if you want to receive e-mail notifications on a non-default address ##BSUB -u your_email_address ### -- send notification at start -- #BSUB -B ### -- send notification at completion-- #BSUB -N ### -- Specify the output and error file. %J is the job-id -- ### -- -o and -e mean append, -oo and -eo mean overwrite -- #BSUB -o torch_multiple_%J.out #BSUB -e torch_multiple_%J.err # -- end of LSF options -- # Get the list of nodes-addresses List=$(cat $LSB_DJOB_HOSTFILE | uniq )
# Set the port for the rendezvous protocol PORT=29400
# Here load the modules, and activate the environment if needed
# Change to your own modules, and uncomment
#module load python3/... #source path-to-myenv/bin/activate
# Here is the call to torchrun blaunch -z \"$List\" torchrun --nnodes=2 --nproc_per_node=2 --rdzv_id=$LSB_JOBID --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:$PORT your_python_script.py ...

The script asks for 8 cores (#BSUB -n 8), 4 on each node (#BSUB -R \"span[ptile=4]\") on the queue gpuv100 (#BSUB -q gpuv100), and 2 GPUs (#BSUB -gpu \"num=2:mode=exclusive_process\") on each node, in total 4 GPUs.

Compared to the single node script, before calling torchrun one has to set the port variable to a port that is used by the processes to communicate across the different nodes. It must be an unused port. 29400 is the default value.

In the torch command line, the option

--nnodes=2 tells pytorch that 2 nodes (machines) will be used, and

--nproc_per_node=2 tells pytorch how many GPUs are available, so it must correspond to the number of GPU requested.

-rdzv_id=$LSB_JOBID --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:$PORT

are settings for the communication protocol. Leave everything like this. The only thing that you have to care about is the PORT environment variable, that is set the PORT variable above.

pytorch 1.x-2.x

pytorch jobscripts

Single node jobscript

Multiple nodes jobscript

Best practice and troubleshooting