GPU Usage

Step-by-Step guide

If you would like a overview guide on this topic, We have a Youtube playlist to set up and run an example deep learning workflow, you can follow along step-by-step:
https://www.youtube.com/playlist?list=PLvcoSsXFNRblM4AG5PZwY1AfYEW3EbD9O

Partitions and Hardware

There is one group of GPU nodes available on the Emmy system in Göttingen, with 2 more groups coming soon. These can be seen by running the following from a frontend node

sinfo -o "%25N  %5c  %10m  %32f  %10G %18P " | grep gpu

A summary of the partitions is in the table below:

PartitionPurposeNodesTotalScratchCUDA VersionsGPUCPU CoresRAM
gretefor all, huge jobsggpu[104-136]33GPU-scratch11.7, 11.8, 12.04x A10064512 GiB
grete:sharedfor all, small jobsggpu[01-03,104-136,201-202]38GPU-scratch11.7, 11.8, 12.04 x V100 or 4 or 8 x A10040, 64, 128512 GiB, 768 GiB, or 1TiB
grete:interactivefor interactive testing (hyperparameters, …)ggpu[01-03,101-103]6GPU-scratch11.7, 11.8, 12.0V100 or slices of A10040 or 64768 GiB or 512 GiB
grete:preemptiblefor jobs that can be interrupted and resumed (“preempted”)ggpu[01-03,101-103]6GPU-scratch11.7, 11.8, 12.0V100 or slices of A10040 or 64768 GiB or 512 GiB

Depending on your needs, you need different GPU partitions. As a rule of thumb: training a neural network and other large calculations go to grete or grete:shared. However, if you are interactive testing of model hyperparameters, use grete:interactive.

The hardware and nodes are described in more detail:

  1. grete and grete:shared partitions with 33 and 38 nodes respectively (grete:shared includes the two 8 GPU nodes)
    • CUDA 11.7, 11.8, and 12.0
    • 2 x Mellanox Infiniband HDR host fabric adapters
    • ggpu[104-136]
      • 2 x AMD Zen3 EPYC 7513 CPUs (64 cores total per node)
      • 512 GiB memory
      • 2 x Mellanox Infiniband HDR host fabric adapters
      • 2 x 850 GiB local storage on solid state disks
      • 4 x NVIDIA Tesla A100 40GB
    • ggpu[201-202] (only part of grete:shared, not grete)
      • 2 x AMD Zen2 EPYC 7662 CPUs (128 cores total per node)
      • 1 TiB memory
      • 2 x Mellanox Infiniband HDR host fabric adapters
      • 2 x 850 GiB local storage on solid state disks
      • 8 x NVIDIA Tesla A100 80GB
    • ggpu01 (only part of grete:shared, not grete)
      • 2 x Intel Skylake Gold 6148 CPUs (40 cores per node)
      • 768 GiB memory
      • 1 x Mellanox Infiniband EDR host fabric adapters
      • 2 x 800 GiB local storage on a solid state disk
      • 4 x NVIDIA Tesla V100 32GB
    • ggpu[02-03] (only part of grete:shared, not grete)
      • 2 x Intel Cascade Lake Gold 6248 CPUs (40 cores per node)
      • 768 GiB memory
      • 1 x Mellanox Infiniband EDR host fabric adapters
      • 2 x 800 GiB local storage on a solid state disk
      • 4 x NVIDIA Tesla V100S 32GB
  2. grete:interactive, and grete:preemptible partitions with 6 nodes
    • CUDA 11.7, 11.8, and 12.0
    • ggpu[101-103]
      • 2 x AMD Zen3 EPYC 7513 CPUs (128 cores total per node)
      • 512 GiB memory
      • 2 x Mellanox Infiniband HDR host fabric adapters
      • 2 x 850 GiB local storage on solid state disks
      • 4 x NVIDIA Tesla A100 40GB split into 3 slices each by MIG (12 GPU slices in total)
        • 2 x 2g.10gb slices on each GPU
        • 1 x 3g.20gb slices on each GPU
    • ggpu01
      • 2 x Intel Skylake Gold 6148 CPUs (40 cores per node)
      • 768 GiB memory
      • 1 x Mellanox Infiniband EDR host fabric adapters
      • 2 x 800 GiB local storage on a solid state disk
      • 4 x NVIDIA Tesla V100 32GB
    • ggpu[02-03]
      • 2 x Intel Cascade Lake Gold 6248 CPUs (40 cores per node)
      • 768 GiB memory
      • 1 x Mellanox Infiniband EDR host fabric adapters
      • 2 x 800 GiB local storage on a solid state disk
      • 4 x NVIDIA Tesla V100S 32GB

Which Frontend to Use in Göttingen

Frontend Frontend

The system is divided between GPU and CPU regarding login and storage. This is done to make file access fast.
Login
There are different login nodes available (bottom of figure). One login node (glogin9.hlrn.de or the DNS alias glogin-gpu.hlrn.de) is made to orchestrate your GPU computations. Use this one if possible.
Using Storage
When you specify the path in your computations, be aware that there are different /scratch systems available.
Depending on the node from which you access the storage, you will need to specify a different path.

  • There is a GPU-scratch system that is only accessible via the GPU and the corresponding frontend glogin9 (left side of figure). To access this from the glogin9 or Grete , use /scratch .
  • The scratch system for the CPU is available from all other login nodes and CPU as scratch , from the GPU part of the system as scratch-emmy (right side of the figure).
  • Try to use the corresponding scratch system as it will give you the fastest connection for your computations (see non-dotted line in figure).

Getting Access

The node can be accessed using the respective partition. Additionally you can specify, how many GPUs you need (by default, you will get access to all) with the -G x options, where x is the number of GPUs you want to access. If you do not use MPI, please use the -c #cores parameter to select the needed CPUs. Note that on non-shared partitions, your jobs will still use nodes exclusively as opposed to sharing them with other jobs. Note that on the partitions where the GPUs are split into slices via MIG, -G x requests x slices. To explicitely request a 80GB GPU please add the option --constraint=80gb to your jobscript

Example

the following command gives you access to 32 cores and two A100 GPUs:

srun -p grete:shared --pty -n 1 -c 32 -G A100:2 bash

If you want to run multiple concurrent programs, each using one GPU, here is an example:

#!/bin/bash
#SBATCH -p grete
#SBATCH -t 12:00:00
#SBATCH -N 1
  
srun --exact -n1 -c 16 -G1 --mem-per-cpu 19075M  ./single-gpu-program &
srun --exact -n1 -c 16 -G1 --mem-per-cpu 19075M  ./single-gpu-program &
srun --exact -n1 -c 16 -G1 --mem-per-cpu 19075M  ./single-gpu-program &
srun --exact -n1 -c 16 -G1 --mem-per-cpu 19075M  ./single-gpu-program &
wait

More explanation for the above example can be found here.

grete And grete:shared

grete_node_diagram grete_node_diagram

Each Grete node consists of 4 x V100 with 32 GiB of memory GPUS, 4 x A100 with 40 GiB of memory GPUs, or 8 x A100 with 80 GiB of memory GPUs (ggpu[01-03,201-202], only included in grete:shared). It is important that with each job submission, you need to request CPU cores as well as GPUs. In these partitions, you can only request whole GPUs.

  • On grete, you automatically block the whole node, getting all 4 GPUs on each node (8 GPU nodes are not in this partition). This is best for very large jobs.
  • On grete:shared, you choose how many GPUs you need, and which kind if it matters to you ( -G V100:N for N x V100, -G A100:N for N x A100, and -G N for N of any kind) . Choosing less than all four or eight means that more than one job can run on a node simultaneously (particularly useful for job arrays)). This partition includes the 8 GPU nodes.

It is possible to request more CPU memory than the default of 16GB. If you wanted 20 GiB of memory, you would use the additional Slurm batch argument –mem=20G

The example script below requests 2 GPUs with -G A100:2

#!/bin/bash
#SBATCH --job-name=train-nn-gpu
#SBATCH -t 05:00:00                  # estimated time, adapt to your needs
#SBATCH --mail-user=yourmail@gwdg.de # change this to your mailaddress
#SBATCH --mail-type=all              # send mail when job begins and ends
 
 
#SBATCH -p grete:shared              # the partition
#SBATCH -G A100:2                    # For requesting 2 GPUs.
 
module load anaconda3
module load cuda
source activate dl-gpu # Or whatever you called your anaconda environment.
 
# Printing out some info.
echo "Submitting job with sbatch from directory: ${SLURM_SUBMIT_DIR}"
echo "Home directory: ${HOME}"
echo "Working directory: $PWD"
echo "Current node: ${SLURM_NODELIST}"
 
# For debugging purposes.
python --version
python -m torch.utils.collect_env
nvcc -V
 
# Run the script:
python -u train.py

Interactive Usage and GPU Slices on grete:interactive

Whole A100 GPUs are powerful, but you might not need all the power. For instance, debugging a script until it runs might require a GPU, but not the full power of an A100. To this end, the grete:interactive partition is provided. The idea is that you can come into the office, log into the cluster, and use this for coding. The partition contains V100 GPUs, as well as A100 GPUs split into slices with the NVIDIA Multi-Instance-GPU (MIG) technology. Each split GPU slice consists of its computation part (a subset of the streaming multiprocessors) and its memory.

grete_mig_node_diagram grete_mig_node_diagram

We currently split each A100 GPU into two 2g.10gb splices and one 3g.20gb splice, which means that each split 10 GiB or 20 GiB of GPU memory (the second number in the split specification) and 2 or 3 compute units (the first number in the split specification). This translates to 3 slices per GPU. Given that they have 4 GPUs respectively, this means that the nodes A100 nodes have 12 slices each (eight 2g.10gb slices and four 3g.20gb slices total). The slices have 28 and 42 stream multiprocessors (SM). These splits CANNOT be changed by job scripts.

Instead of requesting whole GPU with -G A100:1, you request the slices in the Nvidia-format with -G 2g.10gb:1. The following interactive Slurm example requests two 2g.10gb slices:

srun --pty -p grete:interactive  -G 2g.10gb:2 /bin/bash

Or, you can request a whole V100 with the following Slurm example

srun --pty -p grete:interactive  -G V100:1 /bin/bash

Monitoring

Once you submitted a job, you can check its status and where it runs with

squeue --me

In Your Script

Many packages that use GPUs provide a way to check the resources that are available. For example, if you do deep learning using PyTorch, you can always check that the correct number of resources are available with torch.cuda.is_available().

Note that in PyTorch, the command torch.cuda.get_device_name() does not work with Nvidia-MIG splited GPUs and will give you an error.

Using nvitop

To monitor your resource usage on the node itself, you can use nvitop.

  1. Check where your job runs with squeue --me
  2. Log into the node. For instance, if your job runs on ggpu146, use ssh ggpu146
  3. On the node, run
    module load nvitop
    nvitop

grete_nvitop_example grete_nvitop_example

In this example output, you can see your

  • GPU compute usage/utilization (top) as UTL
  • GPU memory usage (top) as MEM
  • your CPU usage (bottom) with your abbreviated user name.

Software and Libraries

Cuda Libraries

To load CUDA, do

module load cuda/VERSION

If you don’t specify the VERSION , the default (12.0) will be used.

Additionally, the anaconda3 python modules have a TensorFlow environment tf-gpu that is precompiled for GPU usage. You can use with

[nimboden@glogin9 ~]$ srun -p grete:shared --pty -G 1 bash
[nimboden@ggpu101 ~]$ module load anaconda3/2020.11
Module for Anaconda3 2020.11 loaded.
[nimboden@ggpu101 ~]$ source $CONDASH
[nimboden@ggpu101 ~]$ conda activate tf-gpu
(tf-gpu) [nimboden@ggpu101 ~]$ python tf_cnn_benchmarks.py [...]

Nvidia HPC SDK (Nvidia compiler)

The full Nvidia HPC SDK 23.3 is installed and useable via the modules nvhpc/23.3 nvhpc-byo-compiler/23.3 nvhpc-hpcx/23.3 nvhpc-nompi/23.3.

This SDK includes the Nvidia compiler and the HPC-X OpenMPI and several base libraries for CUDA based GPU acceleration.

Using MPI and other communication libraries

We have several CUDA enabled OpenMPI versions available. We have the HPC-X OpenMPI in the module nvhpc-hpcx/23.3, the regular OpenMPI 3.1.5 from the Nvidia HPC SDK in nvhpc/23.3 and the OpenMPI included in the Nvidia/Mellanox OFED stack, provided by the module openmpi-mofed/4.1.5a1.

Additional the libraries NCCL 12.0, NVSHMEM 12.0 and OpenMPI 4.0.5 are available as part of the Nvidia HPC SDK in /sw/compiler/nvidia/hpc_sdk/Linux_x86_64/23.3/comm_libs/.

Singularity/Apptainer Containers

The mostly compatible with each other Singularity and Apptainer packages (see Singularity module documentation for module documentation) support injecting NVIDIA drivers and essential libraries and exectuables into running SIF containers (the format produced by both tools). When you run SIF containers, pass the --nv option. An example running the container FOO.sif would be

#!/bin/bash
#SBATCH -p grete:shared              # the partition
#SBATCH -G A100:1                    # For requesting 1 GPU.
#SBATCH -c 4                         # Requestion 4 CPU cores.
 
module load SINGULARITYORAPPTAINER/SINGAPPVERSION
module load cuda/CUDAVERSION
 
singularity run --nv --bind /scratch FOO.sif

where SINGULARITYORAPPTAINER is singularity for using Singularity or apptainer for using Apptainer, SINGAPPVERSION is the desired version of Singularity/Apptainer, and CUDAVERSION is the specific version of CUDA desired.

For more information, see the GPU documentation of the tools

Using Conda/Mamba

Conda and mamba are package managers that can make installing and working with various GPU related packages a lot easier. See Anaconda (conda) and Mamba for information on how to set it up.

Connecting VS Code to Singularity/Apptainer Containers

See VS Code for information on how to do this.

Training Resources (Step-by-Step)

We have regular courses on GPU usage and deep learning workflows. Our materials are online for self-studying. You will need an HLRN account to follow the practical sessions.