Optimizing IO Performance

It is important to remember that the data stores are shared by with other users and bad IO patterns can hurt not only the performance of your jobs but also that of other users. A general recommendation for distributed network filesystems is to keep the number of file metadata operations (opening, closing, stat-ing, truncating, etc.) and checks for file existence or changes as low as possible. These operations often become a bottleneck for the IO of your job, and if bad enough can reduce the performance for other users. For example, if jobs request hundreds of thousands metadata operations like open, close, and stat; this can cause a “slow” filesystem (unresponsiveness) for everyone even when the metadata is stored on SSDs.

Therefore, we provide here some general advice to avoid making the data stores unresponsiveness:

  • Use the local temporary storage of the nodes when possible in jobs (see Temporary Storage for more information).
  • Write intermediate results and checkpoints as seldom as possible.
  • Try to write/read larger data volumes (>1 MiB) and reduce the number of files concurrently open.
  • For inter-process communication use proper protocols (e.g. MPI) instead of using files to communicate between processes.
  • If you want to control your jobs externally, consider using POSIX signals instead of using files frequently opened/read/closed by your program. You can send signals to jobs by scancel --signal=SIGNALNAME JOBID
  • Use MPI-IO to coordinate your I/O instead of each MPI task doing individual POSIX I/O (HDF5 and netCDF may help you with this).
  • Instead of using resursive chmod/chown/chgrp, please use as combination of find (note that Lustre has its own optimized lfs find) and xargs. For example, lfs find /path/to/folder | xargs chgrp PROJECTGROUP creates less stress than chgrp -R PROJECTGROUP /path/to/folder and is much faster.

Analysis of Metadata Operations

An existing application can be investigated with respect to metadata operations. Let us assume an example job script for the parallel application myexample.bin with 16 MPI tasks.

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --time=01:00:00
#SBATCH --partition=standard96

srun ./myexample.bin

The linux command strace can be used to trace IO operations by prepending it to the call to run another program. Then, strace traces that program and creates two files per process (MPI task) with the results. For this example, 32 trace files are created. Large MPI jobs can create a huge number of trace files, e.g. a 128 node job with 128 x 96 MPI tasks created 24576 files. That is why we strongly recommend to reduce the MPI tasks when doing performance analysis as far as possible.

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --time=01:00:00
#SBATCH --partition=standard96

srun strace -ff -t -o trace -e open,openat ./myexample.bin

Analysing one trace file shows all file open activity of one process (MPI task).

> ls -l trace.*
-rw-r----- 1 bzfbml bzfbml 21741 Mar 10 13:10 trace.445215
...
> wc -l trace.445215
258 trace.445215
> cat trace.445215
13:10:37 open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
13:10:37 open("/lib64/libfabric.so.1", O_RDONLY|O_CLOEXEC) = 3
...
13:10:38 open("/scratch/usr/bzfbml/mpiio_zxyblock.dat", O_RDWR) = 8

For the interpretation of the trace file you need to differentiate between the calls originating from your code and the ones that are independent of it (e.g. every shared library your code uses and their shared libraries and so on has to be opened at least once). The example code myexample.bin creates only one file with the name mpiio_zxyblock.dat. 258 open statements in the trace file include only one open from the application which indicates a very desirable meta data activity.

Known Issues

Some codes have well known issues:

  • OpenFOAM: always set runTimeModifiable false and fileHandler collated with a sensible value for purgeWrite and writeInterval (see the OpenFOAM page)
  • NAMD: special considerations during replica-exchange runs (see the NAMD page)

If you have questions or you are unsure regarding your individual scenario, please get in contact with your consultant or start a support request.

Filesystem Specific Tips

Lustre

Some good best practices for using the Lustre SCRATCH/WORK data stores can be found at https://www.nas.nasa.gov/hecc/support/kb/lustre-best-practices_226.html