AI/DL/ML corner


The High Performance Computing landscape is becoming more and more heterogeneous, as users from newly emerging fields are discovering that their computational needs cannot be satisfied any longer by their own personal computer. Our experience suggests that users from ML/DL/AI curricula share some specific needs when starting to familiarise with the HPC environment. This process can be a bit confusing, but in summary, it can be sketched as follows:

  • Learning how to access the cluster, and move data to and from the cluster;
  • Setting up the computational environment on the cluster;
  • Developing/testing the models in interactive sessions;
  • Running the training/inference as batch jobs.

First steps with the cluster

With a cluster account, each user gets access to a personal area of storage with a limited amount of space (HOME directory, currently 30 GB), regularly backed up, and to a significant amount of computational resources, CPU and GPU nodes. 
The HOME directory should be used for the important data, and for data/programs that are supposed to be of interest for long time. For temporary data, we can provide temporary space in form of scratch directories, with no backup but significantly larger, to be used as a working area. To have one, just write at support@hpc.dtu.dk asking for a scratch directory, and explaining what kind of processing you need to do, and if you plan to use GPU for this.
All those filesystems are shared across all the nodes.

Accessing the cluster

Moving data to and from the cluster

Setting up the environment

A lot of software is installed on the cluster. Most of the software is organized in modules, to allow the coexistence of several different versions of the same program. However, most of the software that is used in the ML/AI field is not pre-installed. This is mostly due to two facts:

  • the software used in this field is in continuous development, and the lifetime of a specific version is very short, and a module would become obsolete very quickly;
  • the use-cases are very heterogeneous, and the combinations of packages/tools needed by the users differ so much, that a single module would require the users to install some special pieces of software anyway.

So the common practice is that users have the responsibility of installing the software they need.

Installing software

Interactive usage

Once the needed programs have been installed, the natural step is to try out the environment. To do so, one has to access some special interactive nodes with GPUs. There are special commands to get access to different kinds of GPUs, listed at that page.
Those nodes have multiple GPUs installed, that are accessible in shared mode to all the users simultaneously. This means that tasks from different users can and will interfere with each other. Therefore, it is good practice to use those nodes only for short runs, runs which do not require a lot of GPU memory, and which do not use all the GPUs. In all these case, the tasks should be submitted as batch jobs to the cluster, requesting the resources in exclusive mode.

The access to GPUs is mostly controlled by the cuda environment variable CUDA_VISIBLE_DEVICES, that is respected by most programs.  When an interactive session is started, the CUDA_VISIBLE_DEVICES variable is set to one of the available GPUs only, to prevent users from accidentally overload the node. If a user needs to use more than one GPUs interactively, one has to manually set the variable. Please read the message in the terminal prompt, to see how to do that.

Finally, before running your program, remember to load the modules and to activate the virtual environment needed, if any. When an interactive session is started, the environment starts from a clean state.

batch jobs

Long runs, runs which require multiple GPUs, or a lot of GPU and CPU memory should always be submitted as batch jobs, requesting the resources, and especially the GPUs, in exclusive mode. 

An example of jobscript that requests GPUs can be found at this page.

Jobscript most relevant options and tips

Best practice