Slurm
This page contains all important information about the batch system Slurm, that you will need to run software. It does not contain every feature that Slurm has to offer. For that, please consult the official documentation and the man pages.
Submission of jobs mainly happens via the sbatch command using jobscript, but interactive jobs and node allocations are also possible using srun
or salloc
. Resource selecttion (e.g. number of nodes or cores) is handled via command parameters, or may be specified in the job script.
Partitions
To match your job requirements to the hardware you can choose among various partitions. Each partition has its own job queue. All available partitions and their corresponding walltime, core number, memory, CPU/GPU types are listed in Compute node partitions.
Parameters
Parameter | SBATCH flag | Comment |
---|---|---|
# nodes | -N <#> | |
# tasks | -n <#> | |
# tasks per node | #SBATCH –tasks-per-node <#> | Different defaults between mpirun and srun |
partition | -p <name> | example: standard96 ; check Compute Partitions |
# CPUs per task | -c <#> | interesting for OpenMP/Hybrid jobs |
Wall time limit | -t hh:mm:ss | |
–mail-type=ALL | See sbatch manpage for different types | |
Project/Account | -A | Specify project for NPL accounting |
Job Scripts
A job script can be any script that contains special instruction for Slurm. Most commonly used forms are shell scripts, such as bash or plain sh. But other scripting languages (e.g. Python, Perl, R) are also possible.
#!/bin/bash
#SBATCH -p medium40
#SBATCH -N 16
#SBATCH -t 06:00:00
module load impi
srun mybinary
The job scripts have to have a shebang line at the top, followed by the #SBATCH options. These #SBATCH comments have to be at the top, as Slurm stops scanning for them after the first non-comment non-whitespace line (e.g. an echo or variable declaration).
More examples can be found in the How To Use section.
Important slurm commands
The commands normally used for job control and management are
Job submission:
sbatch
srunJob status of a specific job:
squeue -j jobID for queues/running jobs
$ scontrol show job jobID for full job information (even after the job finished).Job cancellation:
scancel jobID
scancel -i -u $USER cancel all your jobs (-u $USER) but ask for every job (-i)
scancel -9 send kill SIGKILL instead of SIGTERMJob overview:
$ squeue -l –meJob start (estimated):
squeue –start -j jobIDWorkload overview of the whole system: sinfo (esp. sinfo –format="%25C %A") , squeue -l
Using the Shared Nodes
We provide a varying number of nodes from the large40 and large96 partitions as post processeing nodes in a shared mode, so that multiple jobs can run at once on a single node. You can request CPUs and memory and should take care, that you do not exceed your limits. For each CPU/Hyperthread, there is about 9.6Gb of Memory on large40:shared or 4 on the large96:shared partition.
The maximum walltime on the shared partitions is 2 days.
This is an example for a job script using 10 cores. As this is not a MPI job, srun/mpirun is not needed. This jobs memory usage should not exceed10 * 4096 = 40960Mb
#!/bin/bash
#SBATCH -p large96:shared
#SBATCH -t 1-0 #one day
#SBATCH -n 10
#SBATCH -N 1
python postprocessing.py
Job Walltime
The maximum runtime is set per partition and can be viewed either on the system with sinfo or here. There is no minimum walltime (we cannot stop your jobs from finishing, obviously), but a walltime of at least 1 hour is encouraged. A large amount of smaller, shorter jobs can cause problems with our accounting system. The occasional short job is fine, but if you submit larger amounts of jobs that finish (or crash) quickly, we might have to intervene and temporarily suspend your account. If you have lots of smaller workloads, please consider combining them into a single job that uses at least 1 hour.
Advanced Options
Slurm offers a lot of options for job allocation, process placement, job dependencies and arrays and much more. We cannot exhaustively cover all topics here. As mentioned at the top of the page, please consult the official documentation and the man pages for an in depth description of all parameters.
>48 Hour Jobs & beyond
Most Compute node partitions have a maximum wall time of 48 hours. Under exceptional circumstances, it is possible to get a time extension for individual jobs (past 48 hours) by writing a ticket during normal business hours. To apply for an extension, please write a ticket to hpc-support@gwdg.de, containing the Job ID, the username, the project and the reason why the extension is necessary. Alternatively - under even more exceptional circumstances and also via mail request, including username, project ID and reason - permanent access to Slurm Quality-Of-Service (QoS) levels can be granted, which permit a longer runtime for jobs but have additional restrictions regarding job size (e.g. number of nodes).
However, we recommend permanent access to the loing running QoS only as a last resort. We do not guarantee to refund your NPL on the long running QoS if something fails. Before, you should exploit all possibilities to parallelize/speed up your code or make it restartable (see below).
Dependent & Restartable Jobs - How to pass the wall time limit
If your simulation is restartable, it might be handy to automatically trigger a follow-up job. Simply provide the ID of the previous job as an additional sbatch argument:
# submit a first job, extract job id
jobid=$(sbatch --parsable job1.sbatch)
# submit a second job with dependency: starts only if the previous job terminates successfully)
jobid=$(sbatch --parsable --dependency=afterok:$jobid job2.sbatch)
# submit a third job with dependency: starts only if the previous job terminates successfully)
jobid=$(sbatch --parsable --dependency=afterok:$jobid job3.sbatch)
squeue -l -u $USER will mark all your dependent jobs with “(Dependency)” in the column “NODELIST(REASON)”.
Please note: As soon as a follow-up jobscript (sbatch file) is submitted, you can not change its content any more. Lines starting with #SBATCH will be evaluated immediately. The remaining content of the jobscript is evaluated as soon as the dependency condition is fulfilled and compute nodes are available. Besides afterok there exist other dependency types (sbatch man page).
Job Arrays
Job arrays are the preferred way to submit many similar job. Jobs, for instance, if you need to run the same program on a number of input files, or with different settings or run them with a range of parameters. Arrays are created with the -a start-finish sbatch parameter. E.g. sbatch -a 0-19 will create 20 jobs indexed from 0 to 19. There are different ways to index the arrays, which are described below.
The behavior of the jobs can then be tied to Slurm Environment variables, which tell the program, which part of the array they are.
Job Array Indexing, Stepsize and more
Slurm supports a number of ways to set up the indexing in job arrays.
- Range: -a 0-5
- Multiple values: -a 1,5,12
- Step size: -a 0-5:2 (same as -a 0,2,4)
- Combined: -a 0-5:2,20 (same as -a 0,2,4,20)
Additionally, you can limit the number of simultaneously running jobs with the %x parameter in there:
- -a 0-11%4 only four jobs at once
- -a 0-11%1 run all jobs sequentially
- -a 0-5:2,20%2 everything combined. Run IDs 0,2,4,20, but only two at a time.
You can read everything on array indexing in the sbatch man page.
Slurm Array Environment Variables
The most used environment variable in Slurm arrays is $SLURM_ARRAY_TASK_ID. This contains the index of the job in the array and is different in every Job of the array. Other variables are:
SLURM_ARRAY_TASK_COUNT
Total number of tasks in a array.SLURM_ARRAY_TASK_ID
Job array ID (index) number.SLURM_ARRAY_TASK_MAX
Job array’s maximum ID (index) number.SLURM_ARRAY_TASK_MIN
Job array’s minimum ID (index) number.SLURM_ARRAY_TASK_STEP
Job array’s index step size.SLURM_ARRAY_JOB_ID
Job array’s master job ID number.
Example Job Array
This is an example of a a job array, creates a job for every file ending in β.inpβ in the current workding directory
#!/bin/bash
#SBATCH -p standard96
#SBATCH -t 12:00:00 #one day
#SBATCH -N 16
#SBATCH --tasks-per-node 96
# insert X as the number of .inp files you have -1 (since bash arrays start counting from 0)
# ls *.inp | wc -l
#SBATCH -a 0-X
#SBATCH -o arrayjob-%A_%a #"%A" is replaced by the job ID and "%a" with the array index.
#for safety reasons
shopt -s nullglob
#create a bash array with all files
arr=(./*.inp)
#put your command here. This just runs the fictional "big_computation" program with one of the files as input
./big_computation ${arr[$SLURM_ARRAY_TASK_ID]}
In this case, you have to get the number of files beforehand (fill in the X). You can also automatically do that by removing the #SBATCH -a
line and adding that information when submitting the job:
sbatch -a 0-$(($(ls ./*.inp | wc -l)-1)) jobarray.sh
The part in the parenthesis just uses ls to output all .inp files, counts them with wc and then subtracts 1, since bash arrays start counting at 0.