NHR Storage
The main shared storage systems (ignoring the local SSDs that many nodes have) are shown in the diagram above with the performance of the connection between each group of nodes and the storage.
One of the most important things to keep in mind is that the NHR cluster itself is split between two computing centers and the tape archive in a third. While they are physically close to each other, the inter-center latency is higher (speed of light issues) and the inter-center bandwidth lower (less fibers) than intra-center connections. This is why there are two different SCRATCH/WORK filesystems, one for each computing center. It is usually best to use the closest one.
The two centers are the MDC (Modular Data Center) and the RZG (Rechenzentrum Göttingen).
The PERM storage (tape archive) is at the FMZ (Fernmeldezentral).
The sites for each sub-cluster are in the table below along with which SCRATCH the symlink /scratch
points to
Sub-cluster | Site (Computing Center) | Target of /scratch symlink |
---|---|---|
Emmy Phase 1 | MDC | SCRATCH MDC (/scratch-emmy ) |
Emmy Phase 2 | MDC | SCRATCH MDC (/scratch-emmy ) |
Emmy Phase 3 | RZG | SCRATCH RZG (/scratch-grete ) |
Grete Phase 1 | RZG | SCRATCH RZG (/scratch-grete ) |
Grete Phase 2 | RZG | SCRATCH RZG (/scratch-grete ) |
SCRATCH MDC and SCRATCH RZG used to be known as “SCRATCH Emmy” and “SCRATCH Grete” respectively because it used to be that all of Emmy was in the MDC and all of Grete was in the RZG, which is no longer the case. This historical legacy can still be seen in the names of their mount points.
Shared Storage Systems
The shared file systems with their capacities and storage technologies are listed in the table below. Many nodes have local SSDs that are also usable as temporary storage during Slurm jobs.
Storage | Capacity | Storage Technology |
---|---|---|
HOME (GPFS) | 340 TiB HDD | GPFS (IBM Spectrum Scale) on HDD exported via NFS |
HOME (VAST) | VAST on SSD exported by NFS | |
PERM | growable PiBs of tape | Tape storage with HDD cache |
SCRATCH MDC (formerly “SCRATCH Emmy”) | 8.4 PiB HDD (110 TiB SSD) | Lustre (HDD data + SSD metadata) and extra SSD pool |
SCRATCH RZG (formerly “SCRATCH Grete”) | 110 TiB SSD | Lustre on SSD |
Software (read-only) | read-only | GPFS (IBM Spectrum Scale) on HDD exported via NFS |
Project Map (read-only) | read-only | VAST on SSD exported via NFS |
Their performance characteristics are described in the table below.
Storage | Site | MDC Performance (metadata) | MDC Performance (data) | RZG Performance (metadata) | RZG Performance (data) |
---|---|---|---|---|---|
HOME | MDC | slow-medium | slow-medium | slow-medium | slow-medium |
PERM | FMZ | very slow | very slow | very slow | very slow |
SCRATCH MDC | MDC | very fast | medium to fast | medium to fast | slow to medium |
SCRATCH RZG | RZG | very fast | very fast | ||
Software | MDC | slow-medium | slow-medium | slow-medium | slow-medium |
Project Map | RZG | slow-medium | slow-medium | slow-medium | slow-medium |
There are two special storage systems which are read-only for users, the Software storage which contains the shared software (see Software) and the Project Map which provides a tree of all projectes created by or migrated to the Project Portal with symlinks to the different storage directories each project has.
Project Storage
Every project gets a storage directory in one or more of the shared filesystems. The provided storage locations depend on the history of the project.
Note that the actual underlying mount and storage paths can and will change, but symlinks will generally be provided for compatibility so that old paths will still work for a while.
For example, it is likely that the emmy
and grete
parts of any path will be renamed to mdc
and rzg
respectively.
HLRN and NHR projects created before 2024/Q2 (created before the Project Portal) get one set of storage locations. New projects starting 2024/Q2 (which use the Project Portal) get a different set of storage locations. When the HLRN and NHR projects created before 2024/Q2 are migrated to the Project Portal, they will get the new storage locations just as new projects do while keeping their existing storage locations for compatibility for a transition period.
Legacy Projects (before Project Portal)
The storage locations for each project can be found at the paths in the table below where PROJECT
is the name of the project (same as its POSIX group and Slurm account).
Deprecated storage locations may be retired when replacement storage systems come on line or after migration to the Project Portal, but this will have a long warning period to migrate data.
Storage | Filesystem Path | Notes |
---|---|---|
HOME (GPFS) | /home/projects/PROJECT | Deprecated |
PERM | /perm/projects/PROJECT | Path will change |
SCRATCH MDC | /scratch-emmy/projects/PROJECT | Deprecated |
SCRATCH RZG | /scratch-grete/projects/PROJECT | Deprecated |
New Projecs (after Project Portal)
Each project gets a read-only directory in the Project Map storage, which we will refer to as PROJECT_DIR
, that contains symlinks to the project’s directories in the other storage systems.
If the other storage systems move, these symlinks will be updated.
The storage locations for each project can be found at the paths in the table below where PROJECT
is the name of the project (same as its hpcProjectId and Slurm account and the tail/suffix of its POSIX group) and PARENT_PROJECTS
is the full tree of all the projects that are the parent of PROJECT
(e.g. A/B/C
if project A has subproject B has subproject C has subproject PROJECT
).
See the Project Map page for more information about the Project Map storage.
Deprecated storage locations may be retired when replacement storage systems come on line, but this will have a long warning period to migrate data.
Storage | Filesystem Path | Notes |
---|---|---|
Project Map (PROJECT_DIR ) | /projects/PARENT_PROJECTS/PROJECT | Read-only |
HOME (GPFS) | PROJECT_DIR/dir.project | Deprecated |
PERM | Probably PROJECT_DIR/dir.perm | Coming soon |
SCRATCH MDC (HDD) | PROJECT_DIR/dir.lustre-emmy-hdd | |
SCRATCH MDC (SSD) | PROJECT_DIR/dir.lustre-emmy-ssd | |
SCRATCH RZG | PROJECT_DIR/dir.lustre-grete |
User Storage
Every user gets a storage directory in one or more of the shared filesystems. This includes directories for temporary data (see Temporary Storage for more information).
The provided storage locations depend on how a user account was created. HLRN and NHR user accounts created by the HLRN/NHR user application process get one set of storage locations. User accounts created by the Project Portal (requirement for all users starting 2024/Q2) get another set of storage locations. Users will steadily be migrated to the new system by getting new accounts on project renewal or when the PI/s decide users in their project/s must migrate, after which there will be a period of time to migrate data before the old account is closed.
Note that the actual underlying paths can and will change, but symlinks will generally be provided for compatibility so that old paths will still work for a while.
For example, it is likely that the emmy
and grete
parts of any path will be renamed to mdc
and rzg
respectively.
Legacy Users (before Project Portal)
These are user accounts NOT created by the Project Portal.
The storage locations and temporary directories for each user can be found at the paths in the table below where USER
is the username and HOME_BASE
is one of the the base directory for user home directories.
Storage | Filesystem Path | Temporary Directory Path | Notes |
---|---|---|---|
HOME (GPFS) | HOME_BASE/USER | Will possibly move | |
PERM | /perm/USER | Path will change | |
SCRATCH MDC | /scratch-emmy/usr/USER | /scratch-emmy/tmp/USER | Deprecated |
SCRATCH RZG | /scratch-grete/usr/USER | /scratch-grete/tmp/USER | Deprecated |
Project Portal Users
These are user accounts created by the Project Portal, which will be all users starting 2024/Q2.
For accounts created by the Project Portal, there is a primary user account (one’s Academic Cloud account) and then a project-specific user account which share the same primary POSIX group and login credentials (see Upload SSH Key for how to upload credentials).
The storage locations and temporary directories for each user can be found at the paths in the table below where PRIMARYUSER
is the username of the primary account, USER
is the username of the project-specific user account, and HOME_BASE
is one of the the base directory for user home directories.
Storage | Filesystem Path | Temporary Directory Path | Notes |
---|---|---|---|
HOME (VAST) | HOME_BASE/PRIMARYUSER/USER | Will possibly move | |
PERM | To be decided | Coming soon | |
SCRATCH MDC (HDD) | /mnt/lustre-emmy-hdd/usr/USER | /mnt/lustre-emmy-hdd/tmp/USER | Will possibly move |
SCRATCH MDC (SSD) | /mnt/lustre-emmy-ssd/usr/USER | /mnt/lustre-emmy-ssd/tmp/USER | Will possibly move |
SCRATCH RZG | /mnt/lustre-grete/usr/USER | /mnt/lustre-grete/tmp/USER | Will possibly move |
Data Lifetime after Project/User Expiration
In general, we store all data for an extra year after the end of a project or user account. If not extended, the standard term of a project is 1 year. The standard term for a user account is the lifetime of the project it is a member in (lifetime of the last project for the primary account).\
HOME
Each user has a HOME directory which is at HOME_BASE/USER
for legacy users (accounts not made with the Project Portal) and HOME_BASE/PRIMARYUSER/USER
for users whose accounts were made with the Project Portal, where HOME_BASE
is one of the base directories for home directories (there is more than one and they may change in the future).
The HOME directory is for a user’s
- configuration files
- source code
- self-built software
Currently, projects also get a HOME directory at /home/projects/PROJECT
meant for configuration files and software, but this is deprecated and will eventually be retired after a replacement comes into operation.
The HOME storage system has the following characteristics:
- Optimized for a high number of files
- Has limited disk space
- Has backups
- Has a quota
The HOME filesystems are mounted via NFS and shares the underlying storage with a few other filesystems (e.g. Software), so they have slow-medium performance.
We take daily snapshots of the filesystem, which can be used to restore a former state of a file or directory.
These snapshots can be accessed through the path HOME_BASE/.snapshots
.
There are additional regular backups to restore the filesystem in case of a catastrophic failure.
SCRATCH/WORK
The Lustre based SCRATCH filesystems (also known as the WORK filesystems) are meant for active data and are configured for high performance at the expense of robustness (no backups). Each one has the following directories:
SCRATCH/usr/USER
for each user’s files (stored in the environmental variableWORK
)SCRATCH/tmp/USER
for each user’s temporary files (default value for theTMPDIR
environmental variable). See Temporary Storage for more information.SCRATCH/project/PROJECT
for each project’s files, shared among all members of the projectSCRATCH/JOBTMP/JOB
for temporary files for each job (automatically cleaned when a job ends). See Temporary Storage for more information.
The characteristics of the SCRATCH file systems are
- Optimized for good performance by the sub-clusters in the same computing center
- Optimized for high input/output bandwidth by many nodes and jobs at the same time
- Optimized for a moderate number of files
- Meant for active data (data with a short lifetime)
- Has a quota for users and projects (except for SCRATCH RZG at the present time, but this will change in the future)
- Has NO backups
The SCRATCH filesystems have NO BACKUPS. Additionally, they are optimized for performance rather than robustness making them quite fragile. This means there is a non-negligible risk of data on them being completely lost if more than one component in the underlying storage fail at the same time.
To get the best IO performance, it is important to use the SCRATCH filesystem in the same computing center as the sub-cluster you are using. A general recommendation for network filesystems is to keep the number of file metadata operations (opening, closing, stat-ing, truncating, etc.) and checks for file existence or changes as low as possible. These operations often become a bottleneck for the IO of your job, and if bad enough can reduce the performance for other users.
The best performance can be reached with sequential IO of large files that is aligned to the fullstripe size of the underlying RAID6 (1 MiB), especially on SCRATCH MDC in the HDD pool (HDD for data, SSD for metadata).
If you are accessing a large file (1+ GiB) from multiple nodes in parallel, please consider setting the striping of the file with the Lustre command lfs setstripe
with a sensible stripe-count
(recommend up to 32) and a stripe-size
which is a multiple of the RAID6 fullstripe size (1 MiB) and matches the IO sizes of your job.
This can be done to specific file or for a whole directory.
But changes apply only for new files, so applying a new striping to an existing file requires a file copy.
An example of setting the stripe size and count is given below (run man lfs-setstripe
for more information about the command).
lfs setstripe --stripe-size 1M --stripe-count 16 PATH
SCRATCH MDC provided a peak performance of around 65GiB/s streaming bandwith during its acceptance test. With higher number of processes and nodes trying to use it, the effective (write) streaming bandwidth is reduced.
PERM Tape Archive
The magnetic tape archive provides additional storage for inactive data for short-term archival and to free up space on the HOME and SCRATCH/WORK filesystems.
It is only accessible from the frontend nodes.
Its capacity grows as more tapes are added.
The user directories at the present time are at /perm/USER
and the project directories are at /perm/projects/PROJECT
.
PERM will be moved soon and will then have a different path.
Its characteristics are
- Secure file system location on magnetic tapes
- Extremly high latency per IO operation, especially for reading data not in the HDD cache (minutes to open a file)
- Optimized for a small number of large files
- Short-term archival
PERM is a SHORT-TERM archive, NOT a long-term archive. Thus,
- It is not a solution for long-term data archiving.
- There is no guarantee for 10 years according to rules for good scientific practice.
For reasons of efficiency and performance, small files and/or complex directory structures should not be transferred to the archive directly. Please aggregate your data into compressed tarballs or other archive containers with a maximum size of 5.5 TiB before copying your data to the archive. For large data, a good target size is 1-2 TiB per file because such files will usually not be split across more than one tape.