Job runtimes and QoS

Job Walltime

The maximum runtime is set per partition and can be seen either on the system with sinfo or here. There is no minimum walltime (we cannot stop your jobs from finishing, obviously), but a walltime of at least 1 hour is strongly recommended. Our system is optimized for high performance, not high throughput. A large amount of smaller, shorter jobs induces a lot of overhead as each job has a prolog, setting up the enviroment, an epilog for cleaning up and bookkeeping which can put a lot of load on the scheduler. The occasional short job is fine, but if you submit larger amounts of jobs that finish (or crash) quickly, we might have to intervene and temporarily suspend your account. If you have lots of smaller workloads, please consider combining them into a single job that runs for at least 1 hour. A tool often recommended to help with issues like this (among other useful features) is Jobber.

>48 Hour Jobs & beyond

Most compute partitions have a maximum wall time of 48 hours. Under exceptional circumstances, it is possible to get a time extension for individual jobs (past 48 hours) by writing a ticket during normal business hours. To apply for an extension, please write a ticket to hpc-support@gwdg.de, containing the Job ID, the username, the project and the reason why the extension is necessary. Alternatively - under even more exceptional circumstances and also via mail request, including username, project ID and reason - permanent access to Slurm Quality-Of-Service (QoS) levels can be granted, which permit a longer runtime for jobs but have additional restrictions regarding job size (e.g. number of nodes).

Info

We recommend permanent access to the long running QoS only as a last resort. We do not guarantee to refund your NPL on the long running QoS if something fails. Before, you should exploit all possibilities to parallelize/speed up your code or make it restartable (see below).

Dependent & Restartable Jobs - How to pass the wall time limit

If your simulation is restartable, it might be handy to automatically trigger a follow-up job. Simply provide the ID of the previous job as an additional sbatch argument:

# submit a first job, extract job id
jobid=$(sbatch --parsable job1.sbatch)
 
# submit a second job with dependency: starts only if the previous job terminates successfully)
jobid=$(sbatch --parsable --dependency=afterok:$jobid job2.sbatch)
 
# submit a third job with dependency: starts only if the previous job terminates successfully)
jobid=$(sbatch --parsable --dependency=afterok:$jobid job3.sbatch)
Note

As soon as a follow-up jobscript (sbatch file) is submitted, you can not change its content any more. Lines starting with #SBATCH will be evaluated immediately. The remaining content of the jobscript is evaluated as soon as the dependency condition is fulfilled and compute nodes are available. Besides afterok there exist other dependency types (sbatch man page).