GPU Usage

Step-by-Step guide

If you would like an overview guide on this topic, we have a Youtube playlist to set up and run an example deep learning workflow. You can follow along our step-by-step guide. Note: The content of this guide is outdated.

Partitions and Hardware

There are multiple partitions with GPU nodes available. You will only have access to some of them, depending on what type of user account you have. An overview of all nodes with GPUs can be displayed by running the following command from a frontend node:

sinfo -o "%25N  %5c  %10m  %32f  %10G %18P " | grep gpu

For more details, see GPU Partitions.

Which Frontend to Use

Of the different login nodes available, we have two (glogin9 and 10, it is recommended to use the DNS alias glogin-gpu.hpc.gwdg.de) dedicated to run GPU workloads. They are closest to the hardware of our GPU nodes, so it is strongly recommended to use them when writing and submitting GPU jobs, but it is possible to do so from the other login nodes as well.

Note

If you are an NHR user:

Only the two login nodes for GPU usage (glogin9 and 10) and the Sapphire Rapids login nodes (glogin11-13) have access to SCRATCH RZG. Please do not use SCRATCH MDC for GPU workloads, as the connection to the GPU nodes is slow! Always use SCRATCH RZG with GPUs, as that has the fastest connection. See Storage Systems for more information.

Getting Access

Nodes can be accessed using the respective partitions. On the shared partitions (see GPU Partitions), you have to specify how many GPUs you need with the -G x option, where x is the number of GPUs you want to access. If you do not use MPI, please use the -c #cores parameter to select the needed CPUs. Note that on non-shared partitions, your jobs will use nodes exclusively as opposed to sharing them with other jobs. Even if you request less than i.e. 4 GPUs, you will still be billed for all GPUs in your reserved nodes! Note that on the partitions where the GPUs are split into slices via MIG, -G x requests x slices. To explicitly request an 80GB GPU please add the option --constraint=80gb to your jobscript.

Example

the following command gives you access to 32 cores and two A100 GPUs:

srun -p grete:shared --pty -n 1 -c 32 -G A100:2 bash

If you want to run multiple concurrent programs, each using one GPU, here is an example:

#!/bin/bash
#SBATCH -p grete
#SBATCH -t 12:00:00
#SBATCH -N 1
  
srun --exact -n1 -c 16 -G1 --mem-per-cpu 19075M  ./single-gpu-program &
srun --exact -n1 -c 16 -G1 --mem-per-cpu 19075M  ./single-gpu-program &
srun --exact -n1 -c 16 -G1 --mem-per-cpu 19075M  ./single-gpu-program &
srun --exact -n1 -c 16 -G1 --mem-per-cpu 19075M  ./single-gpu-program &
wait

More explanation for the above example can be found here.

grete And grete:shared

Each Grete node consists of 4 x V100 with 32 GiB of memory GPUS, 4 x A100 with 40 GiB of memory GPUs, or 8 x A100 with 80 GiB of memory GPUs (ggpu[01-03,201-202], only included in grete:shared). It is important that with each job submission, you need to request CPU cores as well as GPUs. In these partitions, you can only request whole GPUs.

On grete, you automatically block the whole node, getting all 4 GPUs on each node. This is suited best for very large jobs. There are no nodes with 8 GPUs in this partition.
On grete:shared, you choose how many GPUs you need, and what kind (if that matters to you; -G V100:N for N x V100, -G A100:N for N x A100, and -G N for N of any kind). Requesting less than four (or eight) GPUs means more than one job can run on a node simultaneously (particularly useful for job arrays).

It is possible to request more system memory than the default of 16GiB. If you wanted 20GiB of RAM, you would use the additional Slurm argument --mem=20G.

The example script below requests two A100 GPUs:

#!/bin/bash
#SBATCH --job-name=train-nn-gpu
#SBATCH -t 05:00:00                  # estimated time, adapt to your needs
#SBATCH --mail-user=yourmail@gwdg.de # change this to your mailaddress
#SBATCH --mail-type=all              # send mail when job begins and ends
 
 
#SBATCH -p grete:shared              # the partition
#SBATCH -G A100:2                    # For requesting 2 GPUs.
 
module load miniforge3
module load cuda
source activate dl-gpu # Or whatever you called your miniforge/conda environment.
 
# Print out some info.
echo "Submitting job with sbatch from directory: ${SLURM_SUBMIT_DIR}"
echo "Home directory: ${HOME}"
echo "Working directory: $PWD"
echo "Current node: ${SLURM_NODELIST}"
 
# For debugging purposes.
python --version
python -m torch.utils.collect_env
nvcc -V
 
# Run the script:
python -u train.py

Interactive Usage and GPU Slices on grete:interactive

Whole A100 GPUs are powerful, but you might not need all the power. For instance, debugging a script until it runs might require a GPU, but not the full power of an A100. To this end, the grete:interactive partition is provided. The idea is that you can come into the office, log into the cluster, and use this for coding. The partition contains V100 GPUs, as well as A100 GPUs split into slices with the NVIDIA Multi-Instance-GPU (MIG) technology. Each split GPU slice consists of its computation part (a subset of the streaming multiprocessors) and its memory.

An A100 for example can be split into 7 compute units comprised of 14 streaming multiprocessors (SMs) each. When using MIG, 10 of the 108 SMs are used for management and are not available for compute. We currently split each A100 into six 1g.10gb slices and one 1g.20gb slice, which means that each split has one compute unit (“1g”) and 10 GiB or 20 GiB of GPU memory. Given that each node has 4 GPUs respectively, this means that the A100 nodes have 28 slices each.

This might be subject to change, depending on the load of the cluster and requirements reported to us by our users. Use scontrol show node <nodename> to see what slices a node currently offers. These splits are configured by an administrator and can not be changed by job scripts.

Instead of requesting a whole GPU with -G A100:1, you request the slices in the same format Nvidia uses with i.e. -G 1g.10gb:1. The following interactive Slurm example requests two 1g.10gb slices:

srun --pty -p grete:interactive -G 1g.10gb:2 /bin/bash

Or, you can request a whole V100 with the following Slurm example:

srun --pty -p grete:interactive -G V100:1 /bin/bash

Monitoring

Once you submitted a job, you can check its status and where it runs with

squeue --me

In Your Script

Many packages that use GPUs provide a way to check the resources that are available. For example, if you do deep learning using PyTorch, you can always check that the correct number of resources are available with torch.cuda.is_available().

Note that in PyTorch, the command torch.cuda.get_device_name() does not work with Nvidia-MIG splited GPUs and will give you an error.

Using nvitop

To monitor your resource usage on the node itself, you can use nvitop.

Check where your job runs with squeue --me
Log into the node. For instance, if your job runs on ggpu146, use ssh ggpu146
On the node, run
```
module load py-nvitop
nvitop
```

In this example output, you can see your

GPU compute usage/utilization (top) as UTL
GPU memory usage (top) as MEM
your CPU usage (bottom) with your abbreviated user name.

Software and Libraries

Cuda Libraries

To load CUDA, do

module load cuda/VERSION

If you don’t specify the VERSION, the default (12.2.1) will be used.

Nvidia HPC SDK (Nvidia compiler)

The full Nvidia HPC SDK 23.3 is installed and useable via the modules nvhpc/23.3 nvhpc-byo-compiler/23.3 nvhpc-hpcx/23.3 nvhpc-nompi/23.3.

This SDK includes the Nvidia compiler and the HPC-X OpenMPI and several base libraries for CUDA based GPU acceleration.

Using MPI and other communication libraries

We have several CUDA enabled OpenMPI versions available. We have the HPC-X OpenMPI in the module nvhpc-hpcx/23.3, the regular OpenMPI 3.1.5 from the Nvidia HPC SDK in nvhpc/23.3 and the OpenMPI included in the Nvidia/Mellanox OFED stack, provided by the module openmpi-mofed/4.1.5a1.

Additional the libraries NCCL 12.0, NVSHMEM 12.0 and OpenMPI 4.0.5 are available as part of the Nvidia HPC SDK in /sw/compiler/nvidia/hpc_sdk/Linux_x86_64/23.3/comm_libs/.

Singularity/Apptainer Containers

Apptainer (see Apptainer (formerly Singularity)) supports “injecting” NVIDIA drivers plus the essential libraries and exectuables into running containers. When you run SIF containers, pass the --nv option. An example running the container FOO.sif would be:

#!/bin/bash
#SBATCH -p grete:shared              # the partition
#SBATCH -G A100:1                    # For requesting 1 GPU.
#SBATCH -c 4                         # Requestion 4 CPU cores.
 
module load apptainer/VERSION
module load cuda/CUDAVERSION
 
apptainer exec --nv --bind /scratch FOO.sif

where VERSION is the desired version of Apptainer, and CUDAVERSION is the specific version of CUDA desired.

For more information, see the Apptainer GPU documentation.

Using Conda/Mamba/Miniforge

Conda (replaced on our clusters by miniforge3) and Mamba are package managers that can make installing and working with various GPU related packages a lot easier. See Python for information on how to set them up.

Training Resources (Step-by-Step)

We have regular courses on GPU usage and deep learning workflows. Our materials are online for self-studying.

A deep learning example for training a neural network on Grete: https://gitlab-ce.gwdg.de/dmuelle3/deep-learning-with-gpu-cores
Example GPU jobs are being collected at https://gitlab-ce.gwdg.de/gwdg/hpc-usage-examples/-/tree/main/gpu