NHR Storage

Diagram of the connections between each NHR node group and the storage systems. All frontend nodes (glogin[1-13]) have a very slow connection to PERM. All nodes have a slow-medium connection to HOME, Software, and the Project Map. The Emmy Phase 1 and 2 nodes (glogin[1-8] and g[cfs]nXXXX) have a very fast connection to the SCRATCH MDC (formerly known as SCRATCH Emmy). The Grete and Emmy Phase 3 nodes (glogin[9-13], ggpuXX, ggpuXXX, cXXXX, and cmXXXX) have a very fast connection to the SCRATCH RZG (formerly known as SCRATCH Grete) and a medium connection to the SCRATCH MDC.

NHR Storage Systems

Connections between each NHR node group and the different storage systems, with the arrow style indicating the performance (see key at bottom right). Each node group has the node names in bold.

The main shared storage systems (ignoring the local SSDs that many nodes have) are shown in the diagram above with the performance of the connection between each group of nodes and the storage.

Note

One of the most important things to keep in mind is that the NHR cluster itself is split between two computing centers and the tape archive in a third. While they are physically close to each other, the inter-center latency is higher (speed of light issues) and the inter-center bandwidth lower (less fibers) than intra-center connections. This is why there are two different SCRATCH/WORK filesystems, one for each computing center. It is usually best to use the closest one.

The two centers are the MDC (Modular Data Center) and the RZG (Rechenzentrum Göttingen). The PERM storage (tape archive) is at the FMZ (Fernmeldezentral). The sites for each sub-cluster are in the table below along with which SCRATCH the symlink /scratch points to

Sub-clusterSite (Computing Center)Target of /scratch symlink
Emmy Phase 1MDCSCRATCH MDC (/scratch-emmy)
Emmy Phase 2MDCSCRATCH MDC (/scratch-emmy)
Emmy Phase 3RZGSCRATCH RZG (/scratch-grete)
Grete Phase 1RZGSCRATCH RZG (/scratch-grete)
Grete Phase 2RZGSCRATCH RZG (/scratch-grete)

SCRATCH MDC and SCRATCH RZG used to be known as “SCRATCH Emmy” and “SCRATCH Grete” respectively because it used to be that all of Emmy was in the MDC and all of Grete was in the RZG, which is no longer the case. This historical legacy can still be seen in the names of their mount points.

Shared Storage Systems

The shared file systems with their capacities and storage technologies are listed in the table below. Many nodes have local SSDs that are also usable as temporary storage during Slurm jobs.

StorageCapacityStorage Technology
HOME (GPFS)340 TiB HDDGPFS (IBM Spectrum Scale) on HDD exported via NFS
HOME (VAST)VAST on SSD exported by NFS
PERMgrowable PiBs of tapeTape storage with HDD cache
SCRATCH MDC
(formerly “SCRATCH Emmy”)
8.4 PiB HDD
(110 TiB SSD)
Lustre (HDD data + SSD metadata) and extra SSD pool
SCRATCH RZG
(formerly “SCRATCH Grete”)
110 TiB SSDLustre on SSD
Software
(read-only)
read-onlyGPFS (IBM Spectrum Scale) on HDD exported via NFS
Project Map
(read-only)
read-onlyVAST on SSD exported via NFS

Their performance characteristics are described in the table below.

StorageSiteMDC Performance (metadata)MDC Performance (data)RZG Performance (metadata)RZG Performance (data)
HOMEMDCslow-mediumslow-mediumslow-mediumslow-medium
PERMFMZvery slowvery slowvery slowvery slow
SCRATCH MDCMDCvery fastmedium to fastmedium to fastslow to medium
SCRATCH RZGRZGvery fastvery fast
SoftwareMDCslow-mediumslow-mediumslow-mediumslow-medium
Project MapRZGslow-mediumslow-mediumslow-mediumslow-medium
Info

There are two special storage systems which are read-only for users, the Software storage which contains the shared software (see Software) and the Project Map which provides a tree of all projectes created by or migrated to the Project Portal with symlinks to the different storage directories each project has.

Project Storage

Every project gets a storage directory in one or more of the shared filesystems. The provided storage locations depend on the history of the project.

Note

Note that the actual underlying mount and storage paths can and will change, but symlinks will generally be provided for compatibility so that old paths will still work for a while. For example, it is likely that the emmy and grete parts of any path will be renamed to mdc and rzg respectively.

HLRN and NHR projects created before 2024/Q2 (created before the Project Portal) get one set of storage locations. New projects starting 2024/Q2 (which use the Project Portal) get a different set of storage locations. When the HLRN and NHR projects created before 2024/Q2 are migrated to the Project Portal, they will get the new storage locations just as new projects do while keeping their existing storage locations for compatibility for a transition period.

Legacy Projects (before Project Portal)

The storage locations for each project can be found at the paths in the table below where PROJECT is the name of the project (same as its POSIX group and Slurm account). Deprecated storage locations may be retired when replacement storage systems come on line or after migration to the Project Portal, but this will have a long warning period to migrate data.

StorageFilesystem PathNotes
HOME (GPFS)/home/projects/PROJECTDeprecated
PERM/perm/projects/PROJECTPath will change
SCRATCH MDC/scratch-emmy/projects/PROJECTDeprecated
SCRATCH RZG/scratch-grete/projects/PROJECTDeprecated

New Projecs (after Project Portal)

Each project gets a read-only directory in the Project Map storage, which we will refer to as PROJECT_DIR, that contains symlinks to the project’s directories in the other storage systems. If the other storage systems move, these symlinks will be updated. The storage locations for each project can be found at the paths in the table below where PROJECT is the name of the project (same as its hpcProjectId and Slurm account and the tail/suffix of its POSIX group) and PARENT_PROJECTS is the full tree of all the projects that are the parent of PROJECT (e.g. A/B/C if project A has subproject B has subproject C has subproject PROJECT). See the Project Map page for more information about the Project Map storage. Deprecated storage locations may be retired when replacement storage systems come on line, but this will have a long warning period to migrate data.

StorageFilesystem PathNotes
Project Map (PROJECT_DIR)/projects/PARENT_PROJECTS/PROJECTRead-only
HOME (GPFS)PROJECT_DIR/dir.projectDeprecated
PERMProbably PROJECT_DIR/dir.permComing soon
SCRATCH MDC (HDD)PROJECT_DIR/dir.lustre-emmy-hdd
SCRATCH MDC (SSD)PROJECT_DIR/dir.lustre-emmy-ssd
SCRATCH RZGPROJECT_DIR/dir.lustre-grete

User Storage

Every user gets a storage directory in one or more of the shared filesystems. This includes directories for temporary data (see Temporary Storage for more information).

The provided storage locations depend on how a user account was created. HLRN and NHR user accounts created by the HLRN/NHR user application process get one set of storage locations. User accounts created by the Project Portal (requirement for all users starting 2024/Q2) get another set of storage locations. Users will steadily be migrated to the new system by getting new accounts on project renewal or when the PI/s decide users in their project/s must migrate, after which there will be a period of time to migrate data before the old account is closed.

Note

Note that the actual underlying paths can and will change, but symlinks will generally be provided for compatibility so that old paths will still work for a while. For example, it is likely that the emmy and grete parts of any path will be renamed to mdc and rzg respectively.

Legacy Users (before Project Portal)

These are user accounts NOT created by the Project Portal. The storage locations and temporary directories for each user can be found at the paths in the table below where USER is the username and HOME_BASE is one of the the base directory for user home directories.

StorageFilesystem PathTemporary Directory PathNotes
HOME (GPFS)HOME_BASE/USERWill possibly move
PERM/perm/USERPath will change
SCRATCH MDC/scratch-emmy/usr/USER/scratch-emmy/tmp/USERDeprecated
SCRATCH RZG/scratch-grete/usr/USER/scratch-grete/tmp/USERDeprecated

Project Portal Users

These are user accounts created by the Project Portal, which will be all users starting 2024/Q2. For accounts created by the Project Portal, there is a primary user account (one’s Academic Cloud account) and then a project-specific user account which share the same primary POSIX group and login credentials (see Upload SSH Key for how to upload credentials). The storage locations and temporary directories for each user can be found at the paths in the table below where PRIMARYUSER is the username of the primary account, USER is the username of the project-specific user account, and HOME_BASE is one of the the base directory for user home directories.

StorageFilesystem PathTemporary Directory PathNotes
HOME (VAST)HOME_BASE/PRIMARYUSER/USERWill possibly move
PERMTo be decidedComing soon
SCRATCH MDC (HDD)/mnt/lustre-emmy-hdd/usr/USER/mnt/lustre-emmy-hdd/tmp/USERWill possibly move
SCRATCH MDC (SSD)/mnt/lustre-emmy-ssd/usr/USER/mnt/lustre-emmy-ssd/tmp/USERWill possibly move
SCRATCH RZG/mnt/lustre-grete/usr/USER/mnt/lustre-grete/tmp/USERWill possibly move

Data Lifetime after Project/User Expiration

In general, we store all data for an extra year after the end of a project or user account. If not extended, the standard term of a project is 1 year. The standard term for a user account is the lifetime of the project it is a member in (lifetime of the last project for the primary account).\

HOME

Each user has a HOME directory which is at HOME_BASE/USER for legacy users (accounts not made with the Project Portal) and HOME_BASE/PRIMARYUSER/USER for users whose accounts were made with the Project Portal, where HOME_BASE is one of the base directories for home directories (there is more than one and they may change in the future). The HOME directory is for a user’s

  • configuration files
  • source code
  • self-built software

Currently, projects also get a HOME directory at /home/projects/PROJECT meant for configuration files and software, but this is deprecated and will eventually be retired after a replacement comes into operation.

The HOME storage system has the following characteristics:

  • Optimized for a high number of files
  • Has limited disk space
  • Has backups
  • Has a quota

The HOME filesystems are mounted via NFS and shares the underlying storage with a few other filesystems (e.g. Software), so they have slow-medium performance. We take daily snapshots of the filesystem, which can be used to restore a former state of a file or directory. These snapshots can be accessed through the path HOME_BASE/.snapshots. There are additional regular backups to restore the filesystem in case of a catastrophic failure.

SCRATCH/WORK

The Lustre based SCRATCH filesystems (also known as the WORK filesystems) are meant for active data and are configured for high performance at the expense of robustness (no backups). Each one has the following directories:

  • SCRATCH/usr/USER for each user’s files (stored in the environmental variable WORK)
  • SCRATCH/tmp/USER for each user’s temporary files (default value for the TMPDIR environmental variable). See Temporary Storage for more information.
  • SCRATCH/project/PROJECT for each project’s files, shared among all members of the project
  • SCRATCH/JOBTMP/JOB for temporary files for each job (automatically cleaned when a job ends). See Temporary Storage for more information.

The characteristics of the SCRATCH file systems are

  • Optimized for good performance by the sub-clusters in the same computing center
  • Optimized for high input/output bandwidth by many nodes and jobs at the same time
  • Optimized for a moderate number of files
  • Meant for active data (data with a short lifetime)
  • Has a quota for users and projects (except for SCRATCH RZG at the present time, but this will change in the future)
  • Has NO backups
Warning

The SCRATCH filesystems have NO BACKUPS. Additionally, they are optimized for performance rather than robustness making them quite fragile. This means there is a non-negligible risk of data on them being completely lost if more than one component in the underlying storage fail at the same time.

To get the best IO performance, it is important to use the SCRATCH filesystem in the same computing center as the sub-cluster you are using. A general recommendation for network filesystems is to keep the number of file metadata operations (opening, closing, stat-ing, truncating, etc.) and checks for file existence or changes as low as possible. These operations often become a bottleneck for the IO of your job, and if bad enough can reduce the performance for other users.

The best performance can be reached with sequential IO of large files that is aligned to the fullstripe size of the underlying RAID6 (1 MiB), especially on SCRATCH MDC in the HDD pool (HDD for data, SSD for metadata). If you are accessing a large file (1+ GiB) from multiple nodes in parallel, please consider setting the striping of the file with the Lustre command lfs setstripe with a sensible stripe-count (recommend up to 32) and a stripe-size which is a multiple of the RAID6 fullstripe size (1 MiB) and matches the IO sizes of your job. This can be done to specific file or for a whole directory. But changes apply only for new files, so applying a new striping to an existing file requires a file copy. An example of setting the stripe size and count is given below (run man lfs-setstripe for more information about the command).

lfs setstripe --stripe-size 1M --stripe-count 16 PATH

SCRATCH MDC provided a peak performance of around 65GiB/s streaming bandwith during its acceptance test. With higher number of processes and nodes trying to use it, the effective (write) streaming bandwidth is reduced.

PERM Tape Archive

The magnetic tape archive provides additional storage for inactive data for short-term archival and to free up space on the HOME and SCRATCH/WORK filesystems. It is only accessible from the frontend nodes. Its capacity grows as more tapes are added. The user directories at the present time are at /perm/USER and the project directories are at /perm/projects/PROJECT. PERM will be moved soon and will then have a different path. Its characteristics are

  • Secure file system location on magnetic tapes
  • Extremly high latency per IO operation, especially for reading data not in the HDD cache (minutes to open a file)
  • Optimized for a small number of large files
  • Short-term archival
Warning

PERM is a SHORT-TERM archive, NOT a long-term archive. Thus,

  • It is not a solution for long-term data archiving.
  • There is no guarantee for 10 years according to rules for good scientific practice.

For reasons of efficiency and performance, small files and/or complex directory structures should not be transferred to the archive directly. Please aggregate your data into compressed tarballs or other archive containers with a maximum size of 5.5 TiB before copying your data to the archive. For large data, a good target size is 1-2 TiB per file because such files will usually not be split across more than one tape.