Slurm
This page contains all important information about the batch system Slurm, that you will need to run software. It does not explain every feature Slurm has to offer. For that, please consult the official documentation and the man pages.
Submission of jobs mainly happens via the sbatch command using a jobscript, but interactive jobs and node allocations are also possible using srun
or salloc
.
Resource selection (e.g. number of nodes or cores) is handled via command parameters, or may be specified in the job script.
Partitions
To match your job requirements to the hardware you can choose among various partitions. Each partition has its own job queue. All available partitions and their corresponding walltime, core number, memory, CPU/GPU types are listed in Compute node partitions.
Parameters
Parameter | SBATCH flag | Comment |
---|---|---|
# nodes | -N <minNodes[,maxNodes]> | Minimum and maximum number of nodes that the job should be executed on. If only one number is specified, it is used as the precise node count. |
# tasks | -n <tasks> | The number of tasks for this job. The default is one task per node. |
# tasks per node | --ntasks-per-node=<ntasks> | Number of tasks per node. If -n and --ntasks-per-node is specified, this options specifies the maximum number tasks per node. Different defaults between mpirun and srun. |
partition | -p <name> | Specifies in which partition the job should run. Multiple partitions can be specified in a comma separated list. Example: standard96 ; check Compute Partitions. |
# CPUs per task | -c <cpus per task> | The number of cpus per tasks. The default is one cpu per task. |
Wall time limit | -t hh:mm:ss | Maximum runtime of the job. If this time is exceeded the job is killed. Acceptable formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds” (example: 1-12:00:00 will request 1 day and 12 hours). |
Memory per node | --mem=<size[units]> | Required memory per node. The Unit can be one of “[K|M|G|T]”, but default is M. If the processes exceed the limit, it will be killed. |
Memory per CPU | --mem-per-cpu=<size[units]> | Required memory per task instead of node. --mem and --mem-per-cpu are mutually exclusive. |
Memory per GPU | --mem-per-gpu=<size[units]> | Required memory per gpu instead of node. --mem and --mem-per-gpu are mutually exclusive. |
--mail-type=ALL | See sbatch manpage for different types. | |
Project/Account | -A <project> | Specify project for NPL accounting. This option is mandatory for users who have access to special hardware and want to use the general partitions. |
Output File | -o <file> | Store the job output in file (otherwise written to slurm-<jobid> ). %J in the filename stands for the jobid. |
Job Scripts
A job script can be any script that contains special instructions for Slurm at the top. Most commonly used forms are shell scripts, such as bash or plain sh. But other scripting languages (e.g. Python, Perl, R) are also possible.
#!/bin/bash
#SBATCH -p medium
#SBATCH -N 16
#SBATCH -t 06:00:00
module load openmpi
srun mybinary
The job scripts have to have a shebang line (i.e. #!/bin/bash
) at the top, followed by the #SBATCH
options.
These #SBATCH
comments have to be at the top, as Slurm stops scanning for them after the first non-comment, non-whitespace line (e.g. an echo, variable declaration or module load
in this example).
Important Slurm Commands
The commands normally used for job control and management are
Job submission:
sbatch <jobscript>
srun <arguments> <command>
Job status of a specific job:
squeue -j <jobID>
for queues/running jobsscontrol show job <jobID>
for full job information (even after the job finished).Job cancellation:
scancel <jobID>
scancel -i --me
cancel all your jobs (--me
) but ask for every job (-i
)scancel -9
send kill SIGKILL instead of SIGTERMJob overview:
squeue --me
to show all your jobs Some useful options are:-u <user>, -p <partition>, -j <jobID>
. For examplesqueue -p standard96
will show all jobs currently running or queued in thestandard96
partition.Estimated job start time:
squeue --start -j <jobID>
Workload overview of the whole system:
sinfo
(esp.sinfo -p <partition> --format="%25C %A"
)squeue -l
Job Walltime
It is recommended to always specify a walltime limit for your jobs using the -t
or --time
parameters.
If you don’t set a limit, a default value that is different for each partition is chosen.
You can display the default and maximum time limit for a given partition by running:
$ scontrol show partition standard96
[...]
DefaultTime=2-00:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=300 Hidden=NO
MaxNodes=256 MaxTime=2-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
In this example, the default as well as maximum time is 2 days. As a rule of thumb, the shorter your job’s requested runtime, the easier it is to schedule and the less waiting time you will have in the queue (if the partition does not have idle nodes). As it can be difficult or impossible to predict how long a given workload will actually take to compute, you should always add a bit of a buffer so your job does not end up being killed prematurely, but it is beneficial to request a walltime close to the actual time your jobs will take to complete. For more information and how to increase priority for short jobs or get access to longer walltimes, see Job runtimes and QoS.
Using the Shared Nodes
We provide various partitions in shared mode, so that multiple smaller jobs can run on a single node at the same time. You can request a number of CPUs, GPUs and memory and should take care that you don’t block other users by reserving too much of one resource. For example, when you need all or most of the memory one node offers, but just a few CPU cores, the other cores become effectively unusable for other people. In those cases, please either use an exclusive (non-shared) partition, or request all resources a single node offers (and of course if possible, try to utilize all of them).
The maximum walltime on the shared partitions is 2 days.
This is an example for a job script using 10 cores. As this is not an MPI job, srun/mpirun is not needed.
#!/bin/bash
#SBATCH -p large96:shared
#SBATCH -t 1-0 #one day
#SBATCH -n 10
#SBATCH -N 1
module load python
python postprocessing.py
This job’s memory usage should not exceed 10 * 4096 MiB = 40 GiB.
4096 is the default memory per CPU for the large96:shared
partition, you can see this value (which is different for each partition) by running:
$ scontrol show partition large96:shared
[...]
DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED
Advanced Options
Slurm offers a lot of options for job allocation, process placement, job dependencies and arrays and much more. We cannot exhaustively cover all topics here. As mentioned at the top of the page, please consult the official documentation and the man pages for an in depth description of all parameters.
Job Arrays
Job arrays are the preferred way to submit many similar jobs, for instance, if you need to run the same program on several input files, or run it repeatedly with different settings or parameters. The behavior of your applications inside these jobs can be tied to Slurm environment variables, e.g. to tell the program which part of the array they should process. More information can be found here.
Internet Access within Jobs
It’s not recommended to use an internet connection on the compute nodes, but it is possible if required.
Access can be enabled by specifying -C inet
or --constraint=inet
in your Slurm command line or in a batch script.
srun --pty -p standard96s:test -N 1 -c 1 -C inet /bin/bash
curl www.gwdg.de
#!/bin/bash
#SBATCH -p standard96s:test
#SBATCH -N 1
#SBATCH -c 1
#SBATCH --constraint=inet
curl www.gwdg.de