Question 1

Available resources

Accepted Answer

The current hardware configuration consists of 8 Lenovo ThinkSystem SR665 V3 servers, each with 2 AMD EPYC 9354 32-Core Processors, and 768GB of RAM. Each node is equipped with 2 NVIDIA H100 PCIe GPU cards, with 80 GB of memory each.

The Operating System is Alma Linux (ver. 9.2), and the servers are under control of a scheduling environment (LSF), to manage the concurrent workload of several simultaneous users.

7 of the servers are available for batch jobs and in a queue named p1.
1 of the servers is reserved for interactive usage, and should be used for setting up the models, the computational environment, and for everything that needs interactive access.

The servers are connected to a 60TiB storage, accessible under /dtu/p1, this is reserved to P1 users. Each users get a HOME directory (backed up) with a 30 GB quota limit, and a directory on the p1 storage, not backed-up, with a 500 GB quota.

Question 2

Accessing the P1 resources

Accepted Answer

Access to the resources is restricted to the P1 authorized users. The procedure to get that authorization can be found in the P1 documentation.

From a practical perspective, access is allowed from the DTU network only, i.e. from the DTU physical network or using the DTU VPN. A dedicated login node has been setup:

login9.hpc.dtu.dk

This node can be accessed with any modern ssh client, and the fingerprints for the node are available here.

This node can be used prepare and submit jobs, and do minor tasks, but there are no GPUs. To access the interactive GPU nodes, just type in a terminal the command

p1gpush

This will open a session on one of the interactive nodes. If you need to export graphics from this session, start the session with p1gpush -X.

Question 3

Installing software/moving data

Accepted Answer

The general software stack is made available via a module system. In the starting phase, the amount of software available is quite limited, and it will increase based on the users' feedback and requests.
Users are allowed to install software in their HOME directory, and setup their own environment, as long as this complies with the licensing restriction of the software and the general service agreement for the usage of the cluster.

For more in depth instructions, we refer to the general documentation that is most relevant for AI/DL/ML workflows. There you can also find information about transferring data to and from the system.
As a general rule, we do not provide modules for packages that can be easily installed in user space, and gets frequently updated, like tensorflow or pytorch, but as a notable exception we have 2 pytorch versions installed as modules

pytorch 2.0.1: module load nvidiamodulusbase/pytorch-2.0.1-python-3.11.4-znver4-alma92-aa
pytorch 2.1.0: module load nvidiamodulusbase/pytorch-2.1.0-python-3.11.6-znver4

These are two experimental modules compiler-optimized for the AMD Genoa CPU architecture.

Question 4

Using the cluster

Accepted Answer

The available resources are split in two parts:

An interactive queue: p1i
A batch job queue; p1

A typical workflow usually involves the usage of both: getting access to an interactive node with the command p1gpush (or p1gpush -X to be able to access a terminal with graphics support) to setup the environment, testing/debugging the code, lightweight pre-post processing.

Any production task has to be run as a batch job.

The status of the queue can be monitored with the command:

nodestat p1

nodestat, together with the other most useful commands to monitor the cluster and job status, are explained here.

The current limits for the p1 queue are:

Maximum walltime 72 hours;
Maximum number of GPUs in a job: 2 (on the same node).

Question 5

Support

Accepted Answer

The general documentation for the Pioneer Center cluster can be found here.

To get support with any question related to the cluster usage, write at support@hpc.dtu.dk, specifying \"P1\" in the subject.

For administrative questions write to compute-governance-p1@aicentre.dk.