2026, May - Upgrade to Rocky 10 (Poweruser testing phase)
Our upcoming hardware generation (planned to go online around the end of summer 2026) will run Rocky 10. When it is ready for production (or close to it), the rest of the HPC compute and login nodes will also be upgraded to Rocky 10.
This means we will be upgrading from Rocky 8 straight to Rocky 10, skipping Rocky 9, to reduce the number of changes everyone has to handle. We encourage all users to test their workloads to ensure a smooth transition to Rocky 10 and to benefit from the new hardware generation as soon as it is put into operation. In addition, we are grateful to receive feedback - both success stories and any obstacles encountered - via the ticket system (please include “Rocky 10” in the subject of your ticket).
Test Nodes
We have prepared test nodes that are already running Rocky 10. These nodes are open to powerusers at first to test their codes and provide technical feedback until we widen the scope of the testing phase to all users. We will send a separate announcement when the test nodes open for regular users as well, so everyone has the chance to make sure their problems are addressed before the new operating system is rolled out cluster-wide (missing essential software, MPI problems, etc.). The test nodes have their own partitions which are:
| Partition | User groups | Hardware |
|---|---|---|
standard96:el10 | NHR, SCC | 10 Emmy Phase 2 nodes with SSDs |
medium96s:el10 | NHR, SCC | 4 Emmy Phase 3 nodes |
grete:el10 | KISSKI, NHR, SCC | 2 Grete nodes, each having 4 x A100 80 GiB VRAM |
NHR and KISSKI jobs on these partitions will be “free of charge” (will be accounted as 0 core-hours) during the test phase.
These nodes are shared so multiple users can try them out at once. When the time to upgrade the rest of the system draws closer, more test nodes will be made available.
The Emmy P3 login node glogin13.hpc.gwdg.de will be rebooted into Rocky 10 on the 15th of May and will have the convenience alias glogin-el10.hpc.gwdg.de.
It will be removed from the alias glogin-p3.hpc.gwdg.de for the duration of the testing phase.
Please note that the nodes will likely be rebooted quite often, as new software or features are added, and their configurations are fixed and/or adjusted.
Software Changes
Software highlights of the Rocky 10 upgrade:
- System GCC 8 ⇒ 14
- glibc 2.28 ⇒ 2.39
- System Python 3.6 ⇒ 3.12
- Linux kernel (with heavy backports) 4.18 ⇒ 6.12
- OS packages compiled for x86-64-v1 ⇒ x86-64-v3 (AVX2 baseline)
A new software revision is available on the test nodes using the module system. Some of the highlights:
- Updates to many packages
- Main software compiled with
- GCC 13.4 (GPU nodes)
- GCC 15.2 (CPU Nodes)
- Intel OneAPI Compilers 2025.3.2 (Intel CPU Nodes)
- AMD Optimizing Compilers (AOCC) 5.1.0 (AMD CPU Nodes)
- MPI
- Intel OneAPI
- Update 2021.14.0 ⇒ 2021.17.2
- OpenMPI
- Update 4.1.7 ⇒ 5.0.10
- Switch from PSM2 to OPX provider
- Required by the upcoming hardware generation
- Better latency and bandwidth for small transfers
- Slight degradation of bandwidth for large transfers
- Intel OneAPI
- Python 3.13.12
- CUDA 12.9.1 and 13.1.1
FUSE and Mount Namespaces
Slurm jobs will run in their own private mount namespaces.
Each job now has private, initially-empty /tmp, /var/tmp, and /dev/shm.
FUSE is now setup to allow FUSE mounts onto any directory you have write permission to, without the need to first enter a user namespace using unshare.
Note that this does not apply to login nodes, you will still have to make user and mount namespaces with unshare -Um.
The nodes have several packages for mounting various things via FUSE ready to go:
- ratarmount for mounting many forms of archives (tar, zip, 7z, SquashFS, bind mounting directories, FAT filesystems, ext4, etc.)
- bindfs for bind-mounting directories to other directories
- erofsfuse for erofs
- fuse2fs for ext2/3/4
- fuse-overlayfs for OverlayFS
- squashfuse for SquashFS
- sshfs
- s3fs
There are two important caveats/issues with FUSE in Slurm jobs you must be careful with:
Note
Do not mount on directories that are under a network filesystem mounted by NFS (all HOME directories, and the PROJECT directories of NHR and KISSKI) because you cannot unmount them cleanly with fusermount -u due to root_squash on the NFS.
Warning
Job termination will hard close FUSE mounts, giving no chance for a clean unmount.
For read-only mounts, this doesn’t matter.
But for read-write mounts, this could lead to corruption of the filesystem you mounted.
Make sure to include the unmount commands at the end of your jobs!
You can use #SBATCH --signal=... and a signal handler/trap to handle unmounting a specified time before the job would terminate due to reaching its walltime limit.
An example job script setup to receive a SIGUSR1 (signal 10) 300 seconds before running out of time would be:
...
#SBATCH --signal=B:10@300
...
# Do FUSE mounts at /tmp/foo/bar and /tmp/this/that
...
trap SIGUSR1 'pkill -15 -f my-computation.sh ; fusermount -u /tmp/foo/bar ; fusermount -u /tmp/this/that'
./my-computation.sh [ARGS]
fusermount -u /tmp/foo/bar
fusermount -u /tmp/this/that