Slurm

This page contains all important information about the batch system Slurm, that you will need to run software. It does not explain every feature Slurm has to offer. For that, please consult the official documentation and the man pages.

Submission of jobs mainly happens via the sbatch command using a jobscript, but interactive jobs and node allocations are also possible using srun or salloc. Resource selection (e.g. number of nodes or cores) is handled via command parameters, or may be specified in the job script.

Partitions

To match your job requirements to the hardware you can choose among various partitions. Each partition has its own job queue. All available partitions and their corresponding walltime, core number, memory, CPU/GPU types are listed in Compute node partitions.

Parameters

ParameterSBATCH flagComment
# nodes-N <minNodes[,maxNodes]>Minimum and maximum number of nodes that the job should be executed on. If only one number is specified, it is used as the precise node count.
# tasks-n <tasks>The number of tasks for this job. The default is one task per node.
# tasks per node--ntasks-per-node=<ntasks>Number of tasks per node. If -n and --ntasks-per-node is specified, this options specifies the maximum number tasks per node. Different defaults between mpirun and srun.
partition-p <name>Specifies in which partition the job should run. Multiple partitions can be specified in a comma separated list. Example: standard96 ; check Compute Partitions.
# CPUs per task-c <cpus per task>The number of cpus per tasks. The default is one cpu per task.
Wall time limit-t hh:mm:ssMaximum runtime of the job. If this time is exceeded the job is killed. Acceptable formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds” (example: 1-12:00:00 will request 1 day and 12 hours).
Memory per node--mem=<size[units]>Required memory per node. The Unit can be one of “[K|M|G|T]”, but default is M. If the processes exceed the limit, it will be killed.
Memory per CPU--mem-per-cpu=<size[units]>Required memory per task instead of node. --mem and --mem-per-cpu are mutually exclusive.
Memory per GPU--mem-per-gpu=<size[units]>Required memory per gpu instead of node. --mem and --mem-per-gpu are mutually exclusive.
Mail--mail-type=ALLSee sbatch manpage for different types.
Project/Account-A <project>Specify project for NPL accounting. This option is mandatory for users who have access to special hardware and want to use the general partitions.
Output File-o <file>Store the job output in file (otherwise written to slurm-<jobid>). %J in the filename stands for the jobid.

Job Scripts

A job script can be any script that contains special instructions for Slurm at the top. Most commonly used forms are shell scripts, such as bash or plain sh. But other scripting languages (e.g. Python, Perl, R) are also possible.

#!/bin/bash
 
#SBATCH -p medium
#SBATCH -N 16
#SBATCH -t 06:00:00
 
module load openmpi
srun mybinary

The job scripts have to have a shebang line (i.e. #!/bin/bash) at the top, followed by the #SBATCH options. These #SBATCH comments have to be at the top, as Slurm stops scanning for them after the first non-comment, non-whitespace line (e.g. an echo, variable declaration or module load in this example).

Important Slurm Commands

The commands normally used for job control and management are

  • Job submission:
    sbatch <jobscript> srun <arguments> <command>

  • Job status of a specific job:
    squeue -j <jobID> for queues/running jobs
    scontrol show job <jobID> for full job information (even after the job finished).

  • Job cancellation:
    scancel <jobID>
    scancel -i --me cancel all your jobs (--me) but ask for every job (-i)
    scancel -9 send kill SIGKILL instead of SIGTERM

  • Job overview:
    squeue --me to show all your jobs Some useful options are: -u <user>, -p <partition>, -j <jobID>. For example squeue -p standard96 will show all jobs currently running or queued in the standard96 partition.

  • Estimated job start time:
    squeue --start -j <jobID>

  • Workload overview of the whole system: sinfo (esp. sinfo -p <partition> --format="%25C %A") squeue -l

Job Walltime

It is recommended to always specify a walltime limit for your jobs using the -t or --time parameters. If you don’t set a limit, a default value that is different for each partition is chosen. You can display the default and maximum time limit for a given partition by running:

$ scontrol show partition standard96
[...]
   DefaultTime=2-00:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=300 Hidden=NO
   MaxNodes=256 MaxTime=2-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED

In this example, the default as well as maximum time is 2 days. As a rule of thumb, the shorter your job’s requested runtime, the easier it is to schedule and the less waiting time you will have in the queue (if the partition does not have idle nodes). As it can be difficult or impossible to predict how long a given workload will actually take to compute, you should always add a bit of a buffer so your job does not end up being killed prematurely, but it is beneficial to request a walltime close to the actual time your jobs will take to complete. For more information and how to increase priority for short jobs or get access to longer walltimes, see Job runtimes and QoS.

Using the Shared Nodes

We provide various partitions in shared mode, so that multiple smaller jobs can run on a single node at the same time. You can request a number of CPUs, GPUs and memory and should take care that you don’t block other users by reserving too much of one resource. For example, when you need all or most of the memory one node offers, but just a few CPU cores, the other cores become effectively unusable for other people. In those cases, please either use an exclusive (non-shared) partition, or request all resources a single node offers (and of course if possible, try to utilize all of them).

The maximum walltime on the shared partitions is 2 days.

This is an example for a job script using 10 cores. As this is not an MPI job, srun/mpirun is not needed.

#!/bin/bash
#SBATCH -p large96:shared
#SBATCH -t 1-0 #one day
#SBATCH -n 10
#SBATCH -N 1

module load python
python postprocessing.py

This job’s memory usage should not exceed 10 * 4096 MiB = 40 GiB. 4096 is the default memory per CPU for the large96:shared partition, you can see this value (which is different for each partition) by running:

$ scontrol show partition large96:shared
[...]
   DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED

Advanced Options

Slurm offers a lot of options for job allocation, process placement, job dependencies and arrays and much more. We cannot exhaustively cover all topics here. As mentioned at the top of the page, please consult the official documentation and the man pages for an in depth description of all parameters.

Job Arrays

Job arrays are the preferred way to submit many similar jobs, for instance, if you need to run the same program on several input files, or run it repeatedly with different settings or parameters. The behavior of your applications inside these jobs can be tied to Slurm environment variables, e.g. to tell the program which part of the array they should process. More information can be found here.

Internet Access within Jobs

It’s not recommended to use an internet connection on the compute nodes, but it is possible if required. Access can be enabled by specifying -C inet or --constraint=inet in your Slurm command line or in a batch script.

srun --pty -p standard96s:test -N 1 -c 1 -C inet /bin/bash
curl www.gwdg.de
#!/bin/bash
#SBATCH -p standard96s:test
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --constraint=inet

curl www.gwdg.de