GPU Partitions

Nodes in these partitons provide GPUs for parallelizing calculations. See GPU Usage for more details on how to use GPU partitions, particularly those where GPUs are split into MiG slices.

Islands

The islands with a brief overview of their hardware are listed below.

Island	GPUs	CPUs	Fabric
Grete Phase 1	Nvidia V100	Intel Skylake	Infiniband (100 Gb/s)
Grete Phase 2	Nvidia A100	AMD Zen 3 AMD Zen 2	Infiniband (2 × 200 Gb/s)
Grete Phase 3	Nvidia H100	Intel Sapphire Rapids	Infiniband (2 × 200 Gb/s)
SCC Legacy	Nvidia V100 Nvidia Quadro RTX 5000	Intel Cascade Lake	Omni-Path (2 × 100 Gb/s) Omni-Path (100 Gb/s)

Info

See Logging In for the best login nodes for each island (other login nodes will often work, but may have access to different storage systems and their hardware will be less of a match).

See Cluster Storage Map for the storage systems accessible from each island and their relative performance characteristics.

See Software Stacks for the available and default software stacks for each island.

Legacy SCC users only have access to the SCC Legacy island unless they are also CIDBN, FG, or SOE users in which case they also have access to those islands.

Partitions

The partitions are listed in the table below by which users can use them, without hardware details. See Types of User Accounts to determine which kind of user you are. Note that some users are members of multiple classifications (e.g. all CIDBN/FG/SOE users are also SCC users).

Users	Island	Partition	OS	Shared	Default/Max. Time Limit	Max. Nodes per job	Core-hours per GPU*
NHR	Grete P3	grete-h100	Rocky 8		12/48 hr	16	262.5
		grete-h100:shared	Rocky 8	yes	12/48 hr	16	262.5
	Grete P2	grete	Rocky 8		12/48 hr	16	150
		grete:shared	Rocky 8	yes	12/48 hr	1	150
		grete:preemptible	Rocky 8	yes	12/48 hr	1	47 per slice
NHR, KISSKI, REACT	Grete P2	grete:interactive	Rocky 8	yes	12/48 hr	1	47 per slice
KISSKI	Grete P3	kisski-h100	Rocky 8		12/48 hr	16	262.5
	Grete P2	kisski	Rocky 8		12/48 hr	16	150
REACT	Grete P2	react	Rocky 8	yes	12/48 hr	16	150
SCC	Grete P2 & P3	scc-gpu	Rocky 8	yes	12/48 hr	4	24
all	Grete P1 SCC Legacy	jupyter	Rocky 8	yes	12/48 hr	1	37 + 1 per core

Info

JupyterHub sessions run on the jupyter partition. This partition is oversubscribed (multiple jobs share resources) and is comprised of both CPU and GPU nodes.

Info

The default time limit for most partitions is 12 hours and failed jobs that are requested to run for longer will only get refunded for the 12 hours. This is detailed on the Slurm page about job runtime.

The hardware for the different nodes in each partition are listed in the table below. Note that some partitions are heterogeneous, having nodes with different hardware. Additionally, many nodes are in more than one partition.

Partition	Nodes	GPU + slices	VRAM each	CPU	RAM per node	Cores
grete	35	4 × Nvidia A100	40 GiB	2 × Zen3 EPYC 7513	512 GiB	64
	14	4 × Nvidia A100	80 GiB	2 × Zen3 EPYC 7513	512 GiB	64
	2	4 × Nvidia A100	80 GiB	2 × Zen3 EPYC 7513	1 TiB	64
grete:shared	35	4 × Nvidia A100	40 GiB	2 × Zen3 EPYC 7513	512 GiB	64
	18	4 × Nvidia A100	80 GiB	2 × Zen3 EPYC 7513	512 GiB	64
	2	4 × Nvidia A100	80 GiB	2 × Zen3 EPYC 7513	1 TiB	64
	2	8 × Nvidia A100	80 GiB	2 × Zen2 EPYC 7662	1 TiB	128
grete:interactive	3	4 × Nvidia A100 (1g.10gb, 1g.20gb, 2g.10gb)	10/20 GiB	2 × Zen3 EPYC 7513	512 GiB	64
grete:preemptible	3	4 × Nvidia A100 (1g.10gb, 1g.20gb, 2g.10gb)	10/20 GiB	2 × Zen3 EPYC 7513	512 GiB	64
grete-h100	5	4 × Nvidia H100	94 GiB	2 × Xeon Platinum 8468	1 TiB	96
grete-h100:shared	5	4 × Nvidia H100	94 GiB	2 × Xeon Platinum 8468	1 TiB	96
kisski	34	4 × Nvidia A100	80 GiB	2 × Zen3 EPYC 7513	512 GiB	64
kisski-h100	15	4 × Nvidia H100	94 GiB	2 × Xeon Platinum 8468	1 TiB	96
react	22	4 x Nvidia A100	80 GiB	2 × Zen3 EPYC 7513	512 GiB	64
scc-gpu	1	4 × Nvidia H100	94 GiB	2 × Xeon Platinum 8468	1 TiB	96
	23	4 × Nvidia A100	80 GiB	2 × Zen3 EPYC 7513	512 GiB	64
	6	4 × Nvidia A100	80 GiB	2 × Zen3 EPYC 7513	1 TiB	64
	2	4 × Nvidia A100	40 GiB	2 × Zen3 EPYC 7513	512 GiB	64
jupyter	2	8 × Nvidia V100	32 GiB	2 × Cascadelake 6252	384 GiB	48
	3	4 × Nvidia V100	32 GiB	2 × Skylake Xeon Gold 6148	768 GiB	40
	5	4 × Nvidia RTX500	16 GiB	2 × Cascadelake 6242	192 GiB	32

Info

The actually available memory per node is always less than installed in hardware. Some is reserved by the BIOS, and on top of that, Slurm reserves around 20 GiB for the operating system and background services. To be on the safe side, if you don’t reserve a full node, always deduct ~30 GiB, divide by the number of GPUs and round down to get a number per GPU you can safely request with --mem when submitting jobs.

How to pick the right partition for your job

If you have access to multiple partitions, it can be important to choose one that fits your use case. As a rule of thumb, if you need (or can scale your job to run on) mutliple GPUs, and you are not using a shared partition, make sure to always use a multiple of 4 GPUs, as you will be billed for the whole node regardless of a lower number of GPUs you requested via -G. For jobs that need less than 4 GPUs, use a shared partition and make sure to not request more than your fair share of RAM (see the note above). If you need to get your to start quickly, i.e. for testing if your scripts work or interactive tweaking of hyperparameters, use an interactive partition (most users have access to grete:interactive).

The CPUs and GPUs

For partitions that have heterogeneous hardware, you can give Slurm options to request the particular hardware you want. For CPUs, you can specify the kind of CPU you want by passing a -C/--constraint option to slurm to get the CPUs you want. For GPUs, you can specify the name of the GPU when you pass the -G/--gpus option (or --gpus-per-task) and larger VRAM or a minimum in CUDA Compute Capability using a -C/--constraint option. See Slurm and GPU Usage for more information.

The GPUs, the options to request them, and some of their properties are given in the table below.

GPU	VRAM	FP32 Cores	Tensor Cores	`-G` option	`-C` option (VRAM)	`-C` option (CUDA Compute Cap.)	Compute Cap.
Nvidia H100	94 GiB	8448	528	`H100`	`94gb_vram`, `80gb_vram`, `40gb_vram`, `30gb_vram`, `20gb_vram`, `10gb_vram`	`70cuda_arch`, `75cuda_arch`, `80cuda_arch`, `90cuda_arch`	90
Nvidia A100	40 GiB	6912	432	`A100`	`80gb_vram`, `40gb_vram`, `30gb_vram`, `20gb_vram`, `10gb_vram`	`70cuda_arch`, `75cuda_arch`, `80cuda_arch`	80
	80 GiB	6912	432	`A100`	`80gb_vram`, `40gb_vram`, `30gb_vram`, `20gb_vram`, `10gb_vram`	`70cuda_arch`, `75cuda_arch`, `80cuda_arch`	80
1g.10gb slice of Nvidia A100	10 GiB	864	54	`1g.10gb`	`10gb_vram`	`70cuda_arch`, `75cuda_arch`, `80cuda_arch`	80
1g.20gb slice of Nvidia A100	20 GiB	864	54	`1g.20gb`	`20gb_vram`, `10gb_vram`	`70cuda_arch`, `75cuda_arch`, `80cuda_arch`	80
2g.10gb slice of Nvidia A100	10 GiB	1728	108	`2g.10gb`	`10gb_vram`	`70cuda_arch`, `75cuda_arch`, `80cuda_arch`	80
Nvidia V100	32 GiB	5120	640	`V100`	`30gb_vram`, `20gb_vram`, `10gb_vram`	`70cuda_arch`	70
Nvidia Quadro RTX 5000	16 GiB	3072	384	`RTX5000`			75

The CPUs, the options to request them, and some of their properties are give in the table below.

CPU	Cores	`-C` option	Architecture
AMD Zen3 EPYC 7513	32	`zen3` or `milan`	`zen3`
AMD Zen2 EPYC 7662	64	`zen2` or `rome`	`zen2`
Intel Sapphire Rapids Xeon Platinum 8468	48	`sapphirerapids`	`sapphirerapids`
Intel Cascadelake Xeon Gold 6252	24	`cascadelake`	`cascadelake`
Intel Cascadelake Xeon Gold 6242	16	`cascadelake`	`cascadelake`
Intel Skylake Xeon Gold 6148	20	`skylake`	`skylake_avx512`

Hardware Totals

The total nodes, cores, GPUs, RAM, and VRAM for each island are given in the table below.

Island	Nodes	GPUs	VRAM (TiB)	Cores	RAM (TiB)
Grete Phase 1	3	12	0.375	120	2.1
Grete Phase 2	103	420	27.1	6,720	47.6
Grete Phase 3	21	84	7.9	2,016	21
SCC Legacy	7	36	0.81	176	1.7
TOTAL	134	552	36.2	9,032	72.4