pytorch 1.x-2.x


pytorch is a machine learning framework, based on the torch library,  that gained a lot of traction in recent years in the Machine Learning community.

pytorch is not installed on the cluster, but it can be easily installed in the user personal directories, HOME or any scratch filesystem.
We do not provide any special installation instructions because pytorch has no special external dependency, and so the general instructions that one can find on the official website work smoothly.
As for any python package, we recommend to install it in a virtual environment, following our cluster specific instructions for python.
After that, just follow the instructions from the official pytorch website, selecting the platform and tools you prefer to use. For example, for pip select the fields:

PyTorch:                   Build Stable (XXX)
Your OS:                   Linux
Package:                   Pip
Language:                  Python
Compute Platform:          CUDA XX

and use the command that is produced in the Run this Command field.

pytorch jobscripts

pytorch can use multiple GPUs, on one or multiple nodes. The number of GPUs that it is able to see is controlled, like for most programs which rely on CUDA, by the CUDA_VISIBLE_DEVICES environment variable.  When running in batch mode, this variable is automatically set, so you do not have to set it manually, even if you see this in some instructions taken from guides on the web.

Single node jobscript

Multiple nodes jobscript

Best practice and troubleshooting

  • Please make sure that your scripts creates regular checkpoints, so that you do not have to start from scratch if the job is interrupted, or one of the GPUs fails during the execution.
  • If you request multiple GPUs, make sure that you also tell pythorch to use all of them.
  • Please only use the multiple nodes script if you request all the GPUs on each node.
  • If you get an error like CUDA-capable device(s) is/are busy or unavailable, check that the number of GPUs requested is the same as the torchrun --nproc_per_node option. If this is the case, then request the GPU with the syntax
    #BSUB -gpu "num=X:mode=exclusive_process:mps=yes"