GPU Partitions
Nodes in these partitons provide GPUs for parallelizing calculations. See GPU Usage for more details on how to use GPU partitions, particularly those where GPUs are split into MiG slices.
Partitions
The partitions are listed in the table below by which users can use them, without hardware details. Note that some users are members of multiple classifications (e.g. all CIDBN users are also SCC users).
Users | Partition | OS | Shared | Max. walltime | Max. nodes per job | Core-hours per GPU* |
---|---|---|---|---|---|---|
NHR | grete | Rocky 8 | 48 hr | 16 | 150 | |
grete:shared | Rocky 8 | yes | 48 hr | 1 | 150 | |
grete:preemptible | Rocky 8 | yes | 48 hr | 1 | 47 per slice | |
grete-h100 | Rocky 8 | 48 hr | 16 | 262.5 | ||
grete-h100:shared | Rocky 8 | yes | 48 hr | 16 | 262.5 | |
NHR, KISSKI, REACT | grete:interactive | Rocky 8 | yes | 48 hr | 1 | 47 per slice |
KISSKI | kisski | Rocky 8 | 48 hr | 16 | 150 | |
KISSKI | kisski-h100 | Rocky 8 | 48 hr | 16 | 262.5 | |
REACT, SCC | react | Rocky 8 | yes | 48 hr | 16 | 150 |
SCC | scc-gpu | Rocky 8 | yes | 48 hr | max | 24 |
vis | Rocky 8 | yes | 48 hr | max | 150 | |
ALL | jupyter:gpu (jupyter) | Rocky 8 | yes | 24 hr | 1 | 47 |
The partitions you are allowed to use depend on what kind of account you have. See the table at the bottom of this disambiguation page for more information.
JupyterHub sessions run on the partitions marked with jupyter in the table above. These partitions are oversubscribed (multiple jobs share resources).
The hardware for the different nodes in each partition are listed in the table below. Note that some partitions are heterogeneous, having nodes with different hardware. Additionally, many nodes are in more than one partition.
Partition | Nodes | GPU + slices | VRAM each | CPU | RAM per node* | Cores |
---|---|---|---|---|---|---|
grete | 35 | 4 × Nvidia A100 | 40 GB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 |
14 | 4 × Nvidia A100 | 80 GB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 | |
2 | 4 × Nvidia A100 | 80 GB | 2 × Zen2 EPYC 7513 | 1 TiB | 64 | |
grete:shared | 35 | 4 × Nvidia A100 | 40 GB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 |
18 | 4 × Nvidia A100 | 80 GB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 | |
2 | 4 × Nvidia A100 | 80 GB | 2 × Zen3 EPYC 7513 | 1 TiB | 64 | |
2 | 8 × Nvidia A100 | 80 GB | 2 × Zen2 EPYC 7662 | 1 TiB | 128 | |
grete:interactive | 3 | 4 × Nvidia A100 (2g.10gb and 3g.20gb) | 10/20 GB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 |
grete:preemptible | 3 | 4 × Nvidia A100 (2g.10gb and 3g.20gb) | 10/20 GB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 |
grete-h100 | 5 | 4 × Nvidia H100 | 94 GB | 2 × Xeon Platinum 8468 | 1 TiB | 96 |
grete-h100:shared | 5 | 4 × Nvidia H100 | 94 GB | 2 × Xeon Platinum 8468 | 1 TiB | 96 |
kisski | 34 | 4 × Nvidia A100 | 80 GB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 |
kisski-h100 | 15 | 4 × Nvidia H100 | 94 GB | 2 × Xeon Platinum 8468 | 1 TiB | 96 |
react | 22 | 4 x Nvidia A100 | 80 GB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 |
scc-gpu | 23 | 4 × Nvidia A100 | 80 GB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 |
6 | 4 × Nvidia A100 | 80 GB | 2 × Zen3 EPYC 7513 | 1 TiB | 64 | |
2 | 4 × Nvidia A100 | 40 GB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 | |
jupyter:gpu | 3 | 4 × Nvidia V100 | 32 GB | 2 × Skylake Xeon Gold 6148 | 768 GiB | 40 |
gpu-int | 2 | 4 × Nvidia GTX980 | 4 GB | 2 × Broadwell E5-2650v4 | 128 GiB | 24 |
vis | 3 | 4 × Nvidia GTX980 | 4 GB | 2 × Broadwell E5-2650v4 | 128 GiB | 24 |
*) The actually available memory per node is always less than installed in hardware.
Some is reserved by the BIOS, and on top of that, Slurm reserves around 20 GiB for the operating system and background services.
To be on the safe side, if you don’t reserve a full node, always deduct ~30 GiB, divide by the number of GPUs and round down to get a number per GPU you can safely request with --mem
when submitting jobs.
How to pick the right partition for your job
If you have access to multiple partitions, it can be important to choose one that fits your use case.
As a rule of thumb, if you need (or can scale your job to run on) mutliple GPUs, and you are not using a shared partition, make sure to always use a multiple of 4 nodes, as you will be billed for the whole node regardless of a lower number of GPUs you requested via -G
.
For jobs that need less than 4 GPUs, use a shared partition and make sure to not request more than your fair share of RAM (see the note above).
If you need to get your to start quickly, i.e. for testing if your scripts work or interactive tweaking of hyperparameters, use an interactive partition (most users have access to grete:interactive
).
The CPUs and GPUs
For partitions that have heterogeneous hardware, you can give Slurm options to request the particular hardware you want.
For CPUs, you can specify the kind of CPU you want by passing a -C/--constraint
option to slurm to get the CPUs you want.
For GPUs, you can specify the name of the GPU when you pass the -G
/--gpus
option (or --gpus-per-task
) and larger VRAM using a -C/--constraint
option.
See Slurm and GPU Usage for more information.
The GPUs, the options to request them, and some of their properties are given in the table below.
GPU | VRAM | FP32 cores | Tensor cores | -G option | -C option | Compute Cap. |
---|---|---|---|---|---|---|
Nvidia A100 | 40 GB | 6912 | 432 | A100 | 80 | |
80 GB | 6912 | 432 | A100 | 80gb | 80 | |
Nvidia H100 | 94 GB | 8448 | 528 | H100 | 96gb | 90 |
2g.10gb slice of Nvidia A100 | 10 GB | 1728 | 108 | 2g.10gb | 80 | |
3g.20gb slice of Nvidia A100 | 20 GB | 2592 | 162 | 3g.20gb | 80 | |
Nvidia V100 | 32 GB | 5120 | 640 | V100 | 70 | |
Nvidia Quadro RTX 5000 | 16 GB | 3072 | 384 | RTX5000 | 75 | |
Nvidia GeForce GTX 1080 | 8 GB | 2560 | GTX1080 | 61 | ||
Nvidia GeForce GTX 980 | 4 GB | 2048 | GTX980 | 52 |
The CPUs, the options to request them, and some of their properties are give in the table below.
CPU | Cores | -C option | Architecture |
---|---|---|---|
AMD Zen3 EPYC 7513 | 32 | zen3 or milan | zen3 |
AMD Zen2 EPYC 7662 | 64 | zen2 or rome | zen2 |
Intel Sapphire Rapids Xeon Platinum 8468 | 48 | sapphirerapids | sapphirerapids |
Intel Cascadelake Xeon Gold 6252 | 24 | cascadelake | cascadelake |
Intel Cascadelake Xeon Gold 6242 | 16 | cascadelake | cascadelake |
Intel Skylake Xeon Gold 6148 | 20 | skylake | skylake_avx512 |
Intel Broadwell Xeon E5-2650 V4 | 12 | broadwell | broadwell |
Hardware Totals
The total nodes, cores, GPUs, RAM, and VRAM for each cluster and sub-cluster are given in the table below.
Cluster | Sub-cluster | Nodes | GPUs | VRAM (TiB) | Cores | RAM (TiB) |
---|---|---|---|---|---|---|
NHR | Grete Phase 1 | 3 | 12 | 0.375 | 120 | 2.1 |
Grete Phase 2 | 103 | 420 | 27.1 | 6,720 | 47.6 | |
Grete Phase 3 | 16 | 64 | 6.0 | 1,536 | 15.7 | |
TOTAL | 122 | 496 | 33.5 | 8,376 | 65.4 | |
SCC | TOTAL | 32 | 128 | 2.4 | 2048 | 19.5 |