Temporary Storage

These can be used to reduce the IO load on slower data stores and therefore improve job performance and reduce the impact on other users. A common workflow is

  1. Copy files that must be read multiple times, particularly in random order, into the temporary storage.
  2. Do computations, keeping the most intense operations in the temporary storage (e.g. output files that have to be written over many times or in a random order).
  3. Copy the output files that need to be kept from the temporary storage to some data store.

In this workflow, the temporary storage is used as a staging area for the high intensity IO operations while the other storage locations get low intensity ones (e.g. reading a file beginning to end once in large chunks).

Temporary Storage in a Job

Each batch job has several temporary storages available, which are listed in the table below along with their characteristics. A temporary directory is created in each available one whose path is put into an environmental variable for convenient use. These directories are cleaned up (deleted) when the job ends (no need to manually clean up).

KindSharedPerformanceCapacityEnvironmental Variable
Shared Memory (RAM)localmaxtinySHM_TMPDIR
Local SSD (not all nodes)localhighsmallLOCAL_TMPDIR
SSD SCRATCH/WORKglobalmedium-highmediumSHARED_SSD_TMPDIR
HDD SCRATCH/WORKglobalmediumlargeSHARED_TMPDIR
Info

The Local SSD temporary storage is only available on nodes with their own SSD and thus not available everywhere. To ensure that your job gets only nodes with an internal SSD, you need to add a constraint to your job. The constraint is -C local on the SCC and -C ssd everywhere else. See CPU Partitions and GPU Partitions to see which partitions have nodes with their own SSD and how many.

The environmental variable TMPDIR will be assigned to one of these. You can override that decision by setting it yourself to the value of one of them. For example, to use the Local SSD, you would run the following in Bash

export TMPDIR=$LOCAL_TMPDIR

The local temporary stores have the best performance because all operations stay within the node and don’t have to go over the network. But that inherently means that the other nodes can’t access them. The global temporary stores are accessible by all nodes in the same job, but this comes at the expense of performance. It is possible to use more than one of these in the same job (e.g. use local ones for the highest IO intensity files local to the node and a global one for things that must be shared between nodes). We recommend that you choose the local ones when you can (files that don’t have to be shared between nodes and aren’t too large).

Shared Memory

The fastest temporary storage is local shared memory (stored under /dev/shm) which is a filesystem in RAM. It has the smallest latency and bandwidth can exceed 10 GiB/s in many cases. But it is also the smallest.

The size of the files you create in it count against the memory requested by your job. So make sure to request enough memory in your job for your shared memory utilization (e.g. if you plan on using 10 GiB in it, you need to increase the -m MEMORY amount you request to Slurm by 10 GiB).

Last modified: 2024-08-19 09:33:53