pytorch is a machine learning framework, based on the torch library, that gained a lot of traction in recent years in the Machine Learning community.
pytorch is not installed on the cluster, but it can be easily installed in the user personal directories, HOME or any scratch filesystem.
We do not provide any special installation instructions because pytorch has no special external dependency, and so the general instructions that one can find on the official website work smoothly.
As for any python package, we recommend to install it in a virtual environment, following our cluster specific instructions for python.
After that, just follow the instructions from the official pytorch website, selecting the platform and tools you prefer to use. For example, for pip select the fields:
PyTorch: Build Stable (XXX) Your OS: Linux Package: Pip Language: Python Compute Platform: CUDA XX
and use the command that is produced in the Run this Command field.
NOTE: if you get an error message like PyTorch no longer supports this GPU because it is too old. it means that the version of pytorch you installed has not be compiled for the GPU model you are trying to run on. Try to reinstall it selecting an older CUDA version.
pytorch jobscripts
pytorch can use multiple GPUs, on one or multiple nodes. The number of GPUs that it is able to see is controlled, like for most programs which rely on CUDA, by the CUDA_VISIBLE_DEVICES environment variable.  When running in batch mode, this variable is automatically set, so you do not have to set it manually, even if you see this in some instructions taken from guides on the web.
Single node jobscript
A simple script to run torch on a single node is the following:
#!/bin/sh ### General options ### -- specify queue -- #BSUB -q gpuv100 ### -- set the job Name -- #BSUB -J torch_single_node ### -- ask for number of cores (default: 1) -- #BSUB -n 4 ### -- specify that the cores MUST BE on a single host -- #BSUB -R "span[hosts=1]" ### -- Select the resources: 1 gpu in exclusive process mode -- #BSUB -gpu "num=1:mode=exclusive_process" ### -- set walltime limit: hh:mm -- maximum 24 hours for GPU-queues right now #BSUB -W 1:00 # request 5GB of system-memory #BSUB -R "rusage[mem=5GB]" ### -- set the email address -- # please uncomment the following line and put in your e-mail address, # if you want to receive e-mail notifications on a non-default address ##BSUB -u your_email_address ### -- send notification at start -- #BSUB -B ### -- send notification at completion-- #BSUB -N ### -- Specify the output and error file. %J is the job-id -- ### -- -o and -e mean append, -oo and -eo mean overwrite -- #BSUB -o torch_single_%J.out #BSUB -e torch_single_%J.err # -- end of LSF options -- #here load the modules, and activate the environment if needed #module load python3/... #source path-to-myenv/bin/activate # here call torchrun torchrun --standalone --nproc_per_node=1 your_python_script.py ...
The script asks for 4 cores (#BSUB -n 4) on a single node  (#BSUB -R "span[hosts=1]") on the queue gpuv100 (#BSUB -q gpuv100), and  1 single GPU (#BSUB -gpu "num=1:mode=exclusive_process").
In the torch command line, the option  --nproc_per_node=1 tells pytorch how many GPUs are available, so it must correspond to the number of GPU requested.
For all the other #BSUB options, please refer to the batch job page.
Multiple nodes jobscript
A simple script to run torch on a two nodes is the following:
#!/bin/sh ### General options ### –- specify queue -- #BSUB -q gpuv100 ### -- set the job Name -- #BSUB -J torch_two_nodes ### -- ask for number of cores (default: 1) -- #BSUB -n 8 ### -- specify how many cores on each node -- #BSUB -R "span[ptile=4]" ### -- Select the resources: 2 gpu in exclusive process mode on each node -- #BSUB -gpu "num=2:mode=exclusive_process" ### -- set walltime limit: hh:mm -- maximum 24 hours for GPU-queues right now #BSUB -W 1:00 # request 5GB of system-memory #BSUB -R "rusage[mem=5GB]" ### -- set the email address -- # please uncomment the following line and put in your e-mail address, # if you want to receive e-mail notifications on a non-default address ##BSUB -u your_email_address ### -- send notification at start -- #BSUB -B ### -- send notification at completion-- #BSUB -N ### -- Specify the output and error file. %J is the job-id -- ### -- -o and -e mean append, -oo and -eo mean overwrite -- #BSUB -o torch_multiple_%J.out #BSUB -e torch_multiple_%J.err # -- end of LSF options -- # Get the list of nodes-addresses List=$(cat $LSB_DJOB_HOSTFILE | uniq )
# Set the port for the rendezvous protocol PORT=29400
# Here load the modules, and activate the environment if needed
# Change to your own modules, and uncomment
#module load python3/... #source path-to-myenv/bin/activate
# Here is the call to torchrun blaunch -z "$List" torchrun --nnodes=2 --nproc_per_node=2 --rdzv_id=$LSB_JOBID --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:$PORT your_python_script.py ...
The script asks for 8 cores (#BSUB -n 8), 4 on each node  (#BSUB -R "span[ptile=4]") on the queue gpuv100 (#BSUB -q gpuv100), and  2 GPUs (#BSUB -gpu "num=2:mode=exclusive_process") on each node, in total 4 GPUs.
Compared to the single node script, before calling torchrun one has to set the port variable to a port that is used by the processes to communicate across the different nodes. It must be an unused port. 29400 is the default value.
In the torch command line, the option
--nnodes=2 tells pytorch that 2 nodes (machines) will be used, and 
--nproc_per_node=2 tells pytorch how many GPUs are available, so it must correspond to the number of GPU requested.
-rdzv_id=$LSB_JOBID --rdzv_backend=c10d --rdzv_endpoint=$HOSTNAME:$PORT
are settings for the communication protocol. Leave everything like this. The only thing that you have to care about is the PORT environment variable, that is set the PORT variable above.
Best practice and troubleshooting
- Please make sure that your scripts creates regular checkpoints, so that you do not have to start from scratch if the job is interrupted, or one of the GPUs fails during the execution.
- If you request multiple GPUs, make sure that you also tell pythorch to use all of them.
- Please only use the multiple nodes script if you request all the GPUs on each node.
- If you get an error like CUDA-capable device(s) is/are busy or unavailable, check that the number of GPUs requested is the same as the torchrun--nproc_per_nodeoption. If this is the case, then request the GPU with the syntax#BSUB -gpu "num=X:mode=exclusive_process:mps=yes"
