How scheduling works

General

cf: https://slurm.schedmd.com/priority_multifactor.html

On the GWDG HPC cluster, we use the Slurm multifactor-priority plugin to calculate job priorities for job scheduling, that most HPC clusters use. The alternative that Slurm offers is the so-called “priority/basic” plugin, which does simple FIFO scheduling.

However, the priority is not the most important factor by which the scheduler decides on the next job to look at, instead, it considers the following order:

  1. Jobs that can preempt other jobs
    • Jobs that can preempt other jobs are considered for scheduling at the very first. Job preemption means that certain jobs can be killed, if other jobs need their resources. This is not relevant for us, job preemption is disabled on the GWDG HPC cluster.
  2. Jobs with a reservation
    • Jobs in a reservation are scheduled before all others, since their resources can’t be used by other jobs anyway.
  3. Jobs on partitions with a higher priority tier
    • If the partition of the job has a higher priority tier, the jobs is considered before others. On the GWDG HPC cluster, the *:test partitions have a priority tier of 100 and the partition large96 has a priority tier of 20 and all other partitions have a priority tier of 1, thus, jobs submitted to those partitions are considered before all others.
  4. Job priority
    • At this point, the next job considered is the one with the highest priority (see below). Crucially, this can mean that the job with the highest priority is not necessarily the next job considered for scheduling, if jobs out of (1-3) are present.
  5. Job submit time
    • If two jobs have the same priority, their submit time is used to order them. The older job will be considered first.
  6. Job ID
    • If two jobs have the same priority and the exactly same submit time, Slurm orders them via their job id, which is guaranteed to be unique. The lower job id will be considered first.

The three scheduling loops:

cf: https://slurm.schedmd.com/sched_config.html

Slurm has three individual scheduling loops:

  • Direct scheduling: Runs on job submission, only until default_queue_depth=500
    • Only runs for srun/salloc jobs (defer_batch)
  • Main scheduling loop: Runs periodically, simply orders jobs and schedules until the first “pending”, with partition_job_depth=100
  • Backfill scheduling loop: More comprehensive, able to fill the gaps in between larger jobs, will run periodically as well (with bf_interval=45)

Since Slurm’s scheduling and its loops are very resource intensive and establish mutexes at various points in the code, Slurm is unable to answer pending RPCs (“Remote-Procedure-Calls” to the Slurm controller, such as sinfo / squeue, job step creation, etc.) during the scheduling. To keep the system responsive, the scheduling loops have periodic time-outs.

  • Main scheduling loop: (max_sched_time) 8 seconds
  • Backfill scheduling: (bf_max_time) 75 seconds

Furthermore, the scheduling loops are interrupted if too many RPCs are pending, max_rpc_cnt=50

Since backfill scheduling and, in general, multifactor scheduling are NP-hard problems, the Slurm scheduler can only approximate a solution. To avoid running indefinitely, the backfill scheduler uses several limits in each iteration:

  • Max runtime of backfill scheduler: (bf_max_time) 75 seconds
  • How long the backfill scheduler plans into the future: (bf_window) 2880 minutes
  • How many jobs the backfill scheduler considers at all (bf_max_job_test): 2000
  • How many jobs the backfill scheduler considers per user: (bf_max_job_user): 15
  • How many jobs the backfill scheduler considers per partition: (bf_max_job_part): 200

Priorities

cf: https://slurm.schedmd.com/priority_multifactor.html

The multifactor-priority plugin uses multiple “factors” to construct a “priority” for a job (hence the name), which, in turn, becomes a factor by which the scheduler selects the next job to consider for scheduling. The factors are:

  • Age:
    • Time that the job already waited in the queue. Older jobs get a higher value here. Crucially, jobs reach the maximum age priority after 5 days and stop to accrue age priority after that. Jobs that are on hold, unable to run due to limits or waiting for a dependency also don’t accrue age priority in that time.
  • Association:
    • Associations (i.e. users) can have their own priority factor. This is disabled on the GWDG HPC cluster.
  • Fair-Share:
    • A calculated value, dependent on the jobs a user/account has already submitted compared to other users in the tree around them (see below). Users/Accounts with many previous jobs get a lower value here.
  • Job size:
    • The number of nodes a job will allocate. Larger jobs get a higher value here.
  • Nice:
    • Users can give their jobs an arbitrary (positive) nice value on submit, which will be subtracted from their priority. The higher the value, the lower the priority (the “nicer” one is to other users). Defaults to 0.
  • Partition:
    • Partitions can have a priority factor. On the GWDG HPC cluster, the *:test partitions have a factor of 10 here, all other partitions a factor of 1.
  • QOS:
    • The QOS a job is submitted with can have a priority factor. On the GWDG HPC cluster, the interactive QOS has a factor of 1000, the 2h QOS a factor of 100. All other QOS have a factor of 0.
  • Site:
    • A global site priority factor can be configured to be calculated via a script for each job. This is disabled on the GWDG HPC cluster.
  • TRES:
    • Different TRES (Trackable RESources, i.e. CPUs, RAM, GPUs, etc) can have different priority factors configured. This is disabled on the GWDG HPC cluster.

The different priority factors are each normalized to the highest available value of the factor (thus becoming a number between 0.0 and 1.0) and then multiplied with their corresponding weights:

  • PriorityWeightJobSize: 1,000
  • PriorityWeightAge: 10,000
  • PriorityWeightQOS: 10,000
  • PriorityWeightFairshare: 100,000
  • PriorityWeightPartition: 1,000,000

The priority factors of a job, multiplied with their corresponding weights, are the summed up, the nice value is subtracted, and the result becomes the job’s priority. Since some of the underlying values change over time, such as Age or Fairshare, this calculation is periodically updated. One can view the job’s priority, as well as the individual factors, with the command sprio.

Fairshare

cf: https://slurm.schedmd.com/fair_tree.html

Fairshare is a mechanism that tries to prioritise jobs of users which have, so far, been under-served by the machine. For that, it calculates a fairshare value between 0.0 and 1.0 for each job, which then goes into the priority calculation of the job (see above).

On the GWDG HPC cluster cluster, the “Fair-Tree” algorithm is used. This algorithm ensures that if an entire tree branch is underserved to a different branch (i.e. NHR users compared to SCC users), all entries of that tree branch get a higher priority.

In the “Fair-Tree” algorithm, the Level FS value is calculated for each user association, based on the usage that this association has compared to its siblings in the tree. The Level FS value is then used to rank all associations in a list, with the highest Level FS value first. The highest-ranked association will receive a Fairshare value of 1.0, and all other associations a Fairshare value of their rank, divided by the total number of associations.

To calculate the Level FS value for a association, the formula is LF = S / U, where S is the normalize share value (i.e. the shares of the current association divided by all the shares of the current association and its siblings in the tree) and U is the normalized usage value (i.e. the usage of the current association divided by sum of the usage of the current associations and its siblings in the tree).

For more information on the algorithm, see https://slurm.schedmd.com/fair_tree.html

One can view the Level FS value, the shares (NormShares U and RawShares S) and the Fairshare value with the command sshare -l.