GPU Partitions
Nodes in these partitons provide GPUs for parallelizing calculations. See GPU Usage for more details on how to use GPU partitions, particularly those where GPUs are split into MiG slices.
Islands
The islands with a brief overview of their hardware are listed below.
Island | GPUs | CPUs | Fabric |
---|---|---|---|
Grete Phase 1 | Nvidia V100 | Intel Skylake | Infiniband (100 Gb/s) |
Grete Phase 2 | Nvidia A100 | AMD Zen 3 AMD Zen 2 | Infiniband (2 × 200 Gb/s) |
Grete Phase 3 | Nvidia H100 | Intel Sapphire Rapids | Infiniband (2 × 200 Gb/s) |
SCC Legacy | Nvidia V100 Nvidia Quadro RTX 5000 | Intel Cascade Lake | Omni-Path (2 × 100 Gb/s) Omni-Path (100 Gb/s) |
See Logging In for the best login nodes for each island (other login nodes will often work, but may have access to different storage systems and their hardware will be less of a match).
See Cluster Storage Map for the storage systems accessible from each island and their relative performance characteristics.
See Software Stacks for the available and default software stacks for each island.
Legacy SCC users only have access to the SCC Legacy island unless they are also CIDBN, FG, or SOE users in which case they also have access to those islands.
Partitions
The partitions are listed in the table below by which users can use them, without hardware details. See Types of User Accounts to determine which kind of user you are. Note that some users are members of multiple classifications (e.g. all CIDBN/FG/SOE users are also SCC users).
Users | Island | Partition | OS | Shared | Max. walltime | Max. nodes per job | Core-hours per GPU* |
---|---|---|---|---|---|---|---|
NHR | Grete P3 | grete-h100 | Rocky 8 | 48 hr | 16 | 262.5 | |
grete-h100:shared | Rocky 8 | yes | 48 hr | 16 | 262.5 | ||
Grete P2 | grete | Rocky 8 | 48 hr | 16 | 150 | ||
grete:shared | Rocky 8 | yes | 48 hr | 1 | 150 | ||
grete:preemptible | Rocky 8 | yes | 48 hr | 1 | 47 per slice | ||
NHR, KISSKI, REACT | Grete P2 | grete:interactive | Rocky 8 | yes | 48 hr | 1 | 47 per slice |
Grete P1 | jupyter:gpu (jupyter) | Rocky 8 | yes | 24 hr | 1 | 47 | |
KISSKI | Grete P3 | kisski-h100 | Rocky 8 | 48 hr | 16 | 262.5 | |
Grete P2 | kisski | Rocky 8 | 48 hr | 16 | 150 | ||
REACT | Grete P2 | react | Rocky 8 | yes | 48 hr | 16 | 150 |
SCC | Grete P2 & P3 | scc-gpu | Rocky 8 | yes | 48 hr | max | 24 |
SCC Legacy | jupyter (jupyter) | Rocky 8 | yes | 24 h | 1 |
JupyterHub sessions run on the partitions marked with jupyter in the table above.
These partitions are oversubscribed (multiple jobs share resources).
Additionally, the jupyter
partition is composed of both GPU nodes and CPU nodes (CPU nodes are available to more than just SCC users).
The hardware for the different nodes in each partition are listed in the table below. Note that some partitions are heterogeneous, having nodes with different hardware. Additionally, many nodes are in more than one partition.
Partition | Nodes | GPU + slices | VRAM each | CPU | RAM per node | Cores |
---|---|---|---|---|---|---|
grete | 35 | 4 × Nvidia A100 | 40 GiB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 |
14 | 4 × Nvidia A100 | 80 GiB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 | |
2 | 4 × Nvidia A100 | 80 GiB | 2 × Zen2 EPYC 7513 | 1 TiB | 64 | |
grete:shared | 35 | 4 × Nvidia A100 | 40 GiB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 |
18 | 4 × Nvidia A100 | 80 GiB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 | |
2 | 4 × Nvidia A100 | 80 GiB | 2 × Zen3 EPYC 7513 | 1 TiB | 64 | |
2 | 8 × Nvidia A100 | 80 GiB | 2 × Zen2 EPYC 7662 | 1 TiB | 128 | |
grete:interactive | 3 | 4 × Nvidia A100 (1g.10gb, 1g.20gb, 2g.10gb) | 10/20 GiB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 |
grete:preemptible | 3 | 4 × Nvidia A100 (1g.10gb, 1g.20gb, 2g.10gb) | 10/20 GiB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 |
grete-h100 | 5 | 4 × Nvidia H100 | 94 GiB | 2 × Xeon Platinum 8468 | 1 TiB | 96 |
grete-h100:shared | 5 | 4 × Nvidia H100 | 94 GiB | 2 × Xeon Platinum 8468 | 1 TiB | 96 |
kisski | 34 | 4 × Nvidia A100 | 80 GiB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 |
kisski-h100 | 15 | 4 × Nvidia H100 | 94 GiB | 2 × Xeon Platinum 8468 | 1 TiB | 96 |
react | 22 | 4 x Nvidia A100 | 80 GiB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 |
scc-gpu | 1 | 4 × Nvidia H100 | 94 GiB | 2 × Xeon Platinum 8468 | 1 TiB | 96 |
23 | 4 × Nvidia A100 | 80 GiB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 | |
6 | 4 × Nvidia A100 | 80 GiB | 2 × Zen3 EPYC 7513 | 1 TiB | 64 | |
2 | 4 × Nvidia A100 | 40 GiB | 2 × Zen3 EPYC 7513 | 512 GiB | 64 | |
jupyter:gpu | 3 | 4 × Nvidia V100 | 32 GiB | 2 × Skylake Xeon Gold 6148 | 768 GiB | 40 |
jupyter | 2 | 8 × Nvidia V100 | 32 GiB | 2 × Cascadelake 6252 | 384 GiB | 48 |
5 | 4 × Nvidia RTX500 | 16 GiB | 2 × Cascadelake 6242 | 192 GiB | 32 |
The actually available memory per node is always less than installed in hardware.
Some is reserved by the BIOS, and on top of that, Slurm reserves around 20 GiB for the operating system and background services.
To be on the safe side, if you don’t reserve a full node, always deduct ~30 GiB, divide by the number of GPUs and round down to get a number per GPU you can safely request with --mem
when submitting jobs.
How to pick the right partition for your job
If you have access to multiple partitions, it can be important to choose one that fits your use case.
As a rule of thumb, if you need (or can scale your job to run on) mutliple GPUs, and you are not using a shared partition, make sure to always use a multiple of 4 nodes, as you will be billed for the whole node regardless of a lower number of GPUs you requested via -G
.
For jobs that need less than 4 GPUs, use a shared partition and make sure to not request more than your fair share of RAM (see the note above).
If you need to get your to start quickly, i.e. for testing if your scripts work or interactive tweaking of hyperparameters, use an interactive partition (most users have access to grete:interactive
).
The CPUs and GPUs
For partitions that have heterogeneous hardware, you can give Slurm options to request the particular hardware you want.
For CPUs, you can specify the kind of CPU you want by passing a -C/--constraint
option to slurm to get the CPUs you want.
For GPUs, you can specify the name of the GPU when you pass the -G
/--gpus
option (or --gpus-per-task
) and larger VRAM using a -C/--constraint
option.
See Slurm and GPU Usage for more information.
The GPUs, the options to request them, and some of their properties are given in the table below.
GPU | VRAM | FP32 cores | Tensor cores | -G option | -C option | Compute Cap. |
---|---|---|---|---|---|---|
Nvidia H100 | 94 GiB | 8448 | 528 | H100 | 96gb | 90 |
Nvidia A100 | 40 GiB | 6912 | 432 | A100 | 80 | |
80 GiB | 6912 | 432 | A100 | 80gb | 80 | |
1g.10gb slice of Nvidia A100 | 10 GiB | 864 | 54 | 1g.10gb | 80 | |
1g.20gb slice of Nvidia A100 | 20 GiB | 864 | 54 | 1g.20gb | 80 | |
2g.10gb slice of Nvidia A100 | 10 GiB | 1728 | 108 | 2g.10gb | 80 | |
Nvidia V100 | 32 GiB | 5120 | 640 | V100 | 70 | |
Nvidia Quadro RTX 5000 | 16 GiB | 3072 | 384 | RTX5000 | 75 |
The CPUs, the options to request them, and some of their properties are give in the table below.
CPU | Cores | -C option | Architecture |
---|---|---|---|
AMD Zen3 EPYC 7513 | 32 | zen3 or milan | zen3 |
AMD Zen2 EPYC 7662 | 64 | zen2 or rome | zen2 |
Intel Sapphire Rapids Xeon Platinum 8468 | 48 | sapphirerapids | sapphirerapids |
Intel Cascadelake Xeon Gold 6252 | 24 | cascadelake | cascadelake |
Intel Cascadelake Xeon Gold 6242 | 16 | cascadelake | cascadelake |
Intel Skylake Xeon Gold 6148 | 20 | skylake | skylake_avx512 |
Hardware Totals
The total nodes, cores, GPUs, RAM, and VRAM for each island are given in the table below.
Island | Nodes | GPUs | VRAM (TiB) | Cores | RAM (TiB) |
---|---|---|---|---|---|
Grete Phase 1 | 3 | 12 | 0.375 | 120 | 2.1 |
Grete Phase 2 | 103 | 420 | 27.1 | 6,720 | 47.6 |
Grete Phase 3 | 21 | 84 | 7.9 | 2,016 | 21 |
SCC Legacy | 7 | 36 | 0.81 | 176 | 1.7 |
TOTAL | 134 | 552 | 36.2 | 9,032 | 72.4 |