The High Performance Computing landscape is becoming more and more heterogeneous, as users from newly emerging fields are discovering that their computational needs cannot be satisfied any longer by their own personal computer. Our experience suggests that users from ML/DL/AI curricula share some specific needs when starting to familiarise with the HPC environment. This process can be a bit confusing, but in summary, it can be sketched as follows:
- Learning how to access the cluster, and move data to and from the cluster;
- Setting up the computational environment on the cluster;
- Developing/testing the models in interactive sessions;
- Running the training/inference as batch jobs.
First steps with the cluster
With a cluster account, each user gets access to a personal area of storage with a limited amount of space (HOME directory, currently 30 GB), regularly backed up, and to a significant amount of computational resources, CPU and GPU nodes.
The HOME directory should be used for the important data, and for data/programs that are supposed to be of interest for long time. For temporary data, we can provide temporary space in form of scratch directories, with no backup but significantly larger, to be used as a working area. To have one, just write at support@hpc.dtu.dk
asking for a scratch directory, and explaining what kind of processing you need to do, and if you plan to use GPU for this.
All those filesystems are shared across all the nodes.
Accessing the cluster
All DTU students and employees can access the cluster with the DTU credentials. The practical details on how to access the cluster can be found here.
Access from outside the DTU physical network requires the DTU vpn, or having previously configured the ssh-key pair authentication. The instructions can be found at this page.
Moving data to and from the cluster
Data are a key ingredient in the typical ML/AI workflow, and getting data to and from the cluster is important.
Each node of the cluster has direct access to the internet, so data from public repositories can be downloaded directly to the cluster. However, to move data directly from a local machine to the cluster and back requires a program that support the secure copy protocol (scp). For unix-derived system the command line scp
works fine. But for all platforms there are tool with a Graphical User Interface (GUI), like Filezilla (multiplatform), winscp (Windows), cyberduck (MacOS, Windows).
And for a more advanced usage have a look at rsync
.
NOTE: when transferring files, user transfer.gbar.dtu.dk
as server. This server is better connected to the outside network and to the local filesystems, and it is dedicated to data transfer.
Setting up the environment
A lot of software is installed on the cluster. Most of the software is organized in modules, to allow the coexistence of several different versions of the same program. However, most of the software that is used in the ML/AI field is not pre-installed. This is mostly due to two facts:
- the software used in this field is in continuous development, and the lifetime of a specific version is very short, and a module would become obsolete very quickly;
- the use-cases are very heterogeneous, and the combinations of packages/tools needed by the users differ so much, that a single module would require the users to install some special pieces of software anyway.
So the common practice is that users have the responsibility of installing the software they need.
Installing software
The first step is certainly python. Several versions of python are installed on the cluster as modules. Check the python version requirement for the programs you want to use. Especially if you want to test older packages for testing, they probably are not compatible with the latest python version.
Users are allowed to install any software (complying with the license usage conditions) in their personal space on the cluster. There is no need of special privileges to do that. Our suggestion is to create a virtual environment, so that the installation of a specific package version will not interfere with other pre-installed packages. We provide some specific instructions for python environment and package installation in our python-dedicated page.
NOTES:
-
- One cannot mix packages from different python versions: a package installed with python version 3.9.x cannot be used with python 3.10.y
- Despite what one can find in guides from the web, it is not a good idea to mix conda and pip installation.
- The filesystem is shared, so any software needs only to be installed once, usually in an interactive session. Then in can be used from any node, interactive or in batch mode.
Two of the most commonly used software require some special attention: tensorflow and pytorch.
tensorflow
tensorflow can easily be installed on the cluster, but it has some dependencies (CUDA, cudnn, an tensorrt) that change from (sub-)version to (sub-)version. The instructions on the tensorflow website are generic, and we have a tensorflow page that we try to keep up-to-date.
Please have a look at that page, and feedback is always welcome (support@hpc.dtu.dk
).
pytorch
pytorch can also be installed easily on the cluster. Versions 1.x do not have dependencies on external libraries, but come with their own cuda runtime, therefore the installation in a virtual environment is straightforward. More details, especially on how to run it in batch mode, can be found at our pytorch page.
Interactive usage
Once the needed programs have been installed, the natural step is to try out the environment. To do so, one has to access some special interactive nodes with GPUs. There are special commands to get access to different kinds of GPUs, listed at that page.
Those nodes have multiple GPUs installed, that are accessible in shared mode to all the users simultaneously. This means that tasks from different users can and will interfere with each other. Therefore, it is good practice to use those nodes only for short runs, runs which do not require a lot of GPU memory, and which do not use all the GPUs. In all these case, the tasks should be submitted as batch jobs to the cluster, requesting the resources in exclusive mode.
The access to GPUs is mostly controlled by the cuda environment variable CUDA_VISIBLE_DEVICES
, that is respected by most programs. When an interactive session is started, the CUDA_VISIBLE_DEVICES
variable is set to one of the available GPUs only, to prevent users from accidentally overload the node. If a user needs to use more than one GPUs interactively, one has to manually set the variable. Please read the message in the terminal prompt, to see how to do that.
Finally, before running your program, remember to load the modules and to activate the virtual environment needed, if any. When an interactive session is started, the environment starts from a clean state.
batch jobs
Long runs, runs which require multiple GPUs, or a lot of GPU and CPU memory should always be submitted as batch jobs, requesting the resources, and especially the GPUs, in exclusive mode.
An example of jobscript that requests GPUs can be found at this page.
Jobscript most relevant options and tips
The sections which are most relevant are:
### –- specify queue -- #BSUB -q gpuv100 ### -- ask for number of cores (default: 1) -- #BSUB -n 1 ### -- Select the resources: 1 gpu in exclusive process mode -- #BSUB -gpu "num=1:mode=exclusive_process" ### -- set walltime limit: hh:mm -- maximum 24 hours for GPU-queues right now #BSUB -W 1:00 # request 5GB of system-memory #BSUB -R "rusage[mem=5GB]"
In order:
- The
-q
option si needed to select the queue: use theclassstat
andnodestat
commands to get a list and details on the specific hardware. - The
-n
option specifies how many CPU cores you want to reserve in total. You need at least 1 CPU core for each GPU you are requesting. But most likely you can get better performance requesting more cores. - The
-gpu
line is the one that request the GPUs. Leave the line like this (i.e. always request the GPU inexclusive_process
mode), and only change the"num=X"
part. Remember that the GPUs are per node. - The
-W
option determines the runtime after which the program will be killed. Each queue has a maximum walltime that cannot be exceeded. - The
-R "rusage[mem=5GB]"
line specifies how much memory per CPU core you are requesting. This is the machine RAM, not the GPU memory. If you request a GPU inexclusive_process
mode, you have access to the whole GPU memory. - The
CUDA_AVAILABLE_DEVICES
variable is set automatically when the job is started.
Finally, remember that when the job is submitted, the environment is cleaned. So if the program you want to run needs modules, activating environments, setting special environment variables, to give some examples, you need to put commands for all those operation in your jobscript, before the command that starts the program itself.
Detailed instructions on the jobscripts syntax can be found at this page, and the general instructions on how to submit jobs and other useful commands are collected at this one.
Best practice
Based on our experience, here are some hints:
Data
- Do not put too many files in the same directory. Accessing files becomes very slow if there are too many files in the same directory. A rule of thumb, no more than a few thousand files.
Program/environment setup
- Remember to install the program only once: do not install programs in the jobscript.
- Check if you have leftovers from old courses/projects in your configuration files, which could pollute the environment and prevent your setup from working.
GPU usage
- When you request a GPU, make sure that your script makes use of it: test a short example in an interactive session.
- Do not request multiple GPUs if you are not sure that your script can use them: check the library documentation, and test a short example in an interactive session.
- Find a good balance between the number of GPUs and the number of CPU cores. Here are some hints.
General
- Do not assume that the script that works on a single node also works requesting multiple nodes. It is almost never the case.
- Always specify in your script how many threads the script should use. Python libraries are not good in guessing the right number.
- If the framework you are using supports it, make sure of enabling regular checkpointing/restart feature.
- When in doubt write us at
support@hpc.dtu.dk
.