What does "QOSGrpNodeLimit" mean?

When submitting a job to some queues, e.g. scc-cpu it will likely be waiting in the queue with reason QOSGrpNodeLimit.

This is normal. It means basically the same as Pending, your job is waiting until resources are available to run it. The SCC partitions are usually “full” most times of the year and your jobs will have to wait in the queue until it’s their turn.

But `sinfo -p scc-cpu` shows there are lots of idle nodes!

Some partitions, such as scc-cpu, are configured as so-called “floating partitions”. This means in practice, that scc-cpu is a smaller subset of the larger partition medium96s, limited to a certain number of nodes. While a large number of nodes are logically or rather potentially part of these floating partitions, and displayed as such by sinfo, the total number of nodes that can be in use at one time is restricted (via a Quality of Service (QOS) setting attached to these partitions).

The reason scc-cpu for example is configured that way is very practical. The nodes purchased from SCC funding for the “Sapphire Rapids” generation of the cluster are identical to the nodes purchased from NHR funding for the much larger Emmy Phase 3 partitions. It is a lot more efficient to use the same rack space, the same cooling system, same networks, same deployment system, same images and same software stack than to duplicate all of them. Same goes for the node configuration to be part of the NHR partitions.

The “floating” means that the individual nodes from medium96s that can be part of scc-cpu for instance are not fixed, but can change. Only the maximum number of nodes at a time is constant. This allows the individual nodes that are in use to adapt to current maintenance schedules, current load and demand from different groups of users, but it also causes the Pending reason to be replaced by QOSGrpNodeLimit.

As stated above, it is absolutely normal and expected that when the HPC system is busy, jobs have to wait for resources. You can see how busy the different partitions your current username can use are in the cluster load table shown on login via SSH.

What does "QOSGrpNodeLimit" mean?

But sinfo -p scc-cpu shows there are lots of idle nodes!

But `sinfo -p scc-cpu` shows there are lots of idle nodes!