Question 1

Accessing the cluster

Accepted Answer

All DTU students and employees can access the cluster with the DTU credentials. The practical details on how to access the cluster can be found here.
Access from outside the DTU physical network requires the DTU vpn, or having previously configured the ssh-key pair authentication. The instructions can be found at this page.

Question 2

Moving data to and from the cluster

Accepted Answer

Data are a key ingredient in the typical ML/AI workflow, and getting data to and from the cluster is important.
Each node of the cluster has direct access to the internet, so data from public repositories can be downloaded directly to the cluster. However, to move data directly from a local machine to the cluster and back requires a program that support the secure copy protocol (scp). For unix-derived system the command line scp works fine. But for all platforms there are tool with a Graphical User Interface (GUI), like Filezilla (multiplatform), winscp (Windows), cyberduck (MacOS, Windows).
And for a more advanced usage have a look at rsync.

NOTE: when transferring files, user transfer.gbar.dtu.dk as server. This server is better connected to the outside network and to the local filesystems, and it is dedicated to data transfer.

Question 3

tensorflow

Accepted Answer

tensorflow can easily be installed on the cluster, but it has some dependencies (CUDA, cudnn, an tensorrt) that change from (sub-)version to (sub-)version. The instructions on the tensorflow website are generic, and we have a tensorflow page that we try to keep up-to-date.
Please have a look at that page, and feedback is always welcome (support@hpc.dtu.dk).

Question 4

pytorch

Accepted Answer

pytorch can also be installed easily on the cluster. Versions 1.x do not have dependencies on external libraries, but come with their own cuda runtime, therefore the installation in a virtual environment is straightforward. More details, especially on how to run it in batch mode, can be found at our pytorch page.

Question 5

Installing software

Accepted Answer

The first step is certainly python. Several versions of python are installed on the cluster as modules. Check the python version requirement for the programs you want to use. Especially if you want to test older packages for testing, they probably are not compatible with the latest python version.

Users are allowed to install any software (complying with the license usage conditions) in their personal space on the cluster. There is no need of special privileges to do that. Our suggestion is to create a virtual environment, so that the installation of a specific package version will not interfere with other pre-installed packages. We provide some specific instructions for python environment and package installation in our python-dedicated page.

NOTES:

- One cannot mix packages from different python versions: a package installed with python version 3.9.x cannot be used with python 3.10.y
- Despite what one can find in guides from the web, it is not a good idea to mix conda and pip installation.
- The filesystem is shared, so any software needs only to be installed once, usually in an interactive session. Then in can be used from any node, interactive or in batch mode.

Two of the most commonly used software require some special attention: tensorflow and pytorch.

Question 6

Jobscript most relevant options and tips

Accepted Answer

The sections which are most relevant are:

### –- specify queue -- #BSUB -q gpuv100 ### -- ask for number of cores (default: 1) -- #BSUB -n 1 ### -- Select the resources: 1 gpu in exclusive process mode -- #BSUB -gpu \"num=1:mode=exclusive_process\" ### -- set walltime limit: hh:mm -- maximum 24 hours for GPU-queues right now #BSUB -W 1:00 # request 5GB of system-memory #BSUB -R \"rusage[mem=5GB]\"

In order:

The -q option si needed to select the queue: use the classstat and nodestat commands to get a list and details on the specific hardware.
The -n option specifies how many CPU cores you want to reserve in total. You need at least 1 CPU core for each GPU you are requesting. But most likely you can get better performance requesting more cores.
The -gpu line is the one that request the GPUs. Leave the line like this (i.e. always request the GPU inexclusive_process mode), and only change the \"num=X\" part. Remember that the GPUs are per node.
The -W option determines the runtime after which the program will be killed. Each queue has a maximum walltime that cannot be exceeded.
The -R \"rusage[mem=5GB]\" line specifies how much memory per CPU core you are requesting. This is the machine RAM, not the GPU memory. If you request a GPU in exclusive_process mode, you have access to the whole GPU memory.
The CUDA_AVAILABLE_DEVICES variable is set automatically when the job is started.

Finally, remember that when the job is submitted, the environment is cleaned. So if the program you want to run needs modules, activating environments, setting special environment variables, to give some examples, you need to put commands for all those operation in your jobscript, before the command that starts the program itself.

Detailed instructions on the jobscripts syntax can be found at this page, and the general instructions on how to submit jobs and other useful commands are collected at this one.

Question 7

Best practice

Accepted Answer

Based on our experience, here are some hints:

Data

Do not put too many files in the same directory. Accessing files becomes very slow if there are too many files in the same directory. A rule of thumb, no more than a few thousand files.

Program/environment setup

Remember to install the program only once: do not install programs in the jobscript.
Check if you have leftovers from old courses/projects in your configuration files, which could pollute the environment and prevent your setup from working.

GPU usage

When you request a GPU, make sure that your script makes use of it: test a short example in an interactive session.
Do not request multiple GPUs if you are not sure that your script can use them: check the library documentation, and test a short example in an interactive session.
Find a good balance between the number of GPUs and the number of CPU cores. Here are some hints.

General

Do not assume that the script that works on a single node also works requesting multiple nodes. It is almost never the case.
Always specify in your script how many threads the script should use. Python libraries are not good in guessing the right number.
If the framework you are using supports it, make sure of enabling regular checkpointing/restart feature.
When in doubt write us at support@hpc.dtu.dk.

AI/DL/ML corner

First steps with the cluster