Since the introduction of the current PIK supercomputing cluster in 2015, there have been several periods of extremely high utilisation which caused difficulty for some users whose workflow required fast turnaround times.
A process was initiated by the Modelers Council in 2017 in order to address this problem. The outcome has been a decision to extend the cluster job queuing system to allow almost immediate access to a limited, but substantial, amount of cluster resources per user during periods of high cluster load.
To this effect, we have implemented updates to the queuing system with the introduction of two new queues (QOSs) called "priority" and "standby".
"priority" provides a maximum of 64 CPU cores in up to 5 concurrent jobs per user.
Queuetime for jobs in this QOS is expected to be under 5 minutes.
"standby" is a preemptable QOS. This means that the job queuing system
may cancel jobs running here (with SIGTERM and SIGKILL) if the resources are required for new jobs arriving in the "priority" QOS. To make best use of this QOS your
code should be able to restart from an arbitrary point in it's execution, though this is not essential. Limits for this QOS are 512 CPU cores per job, with no limit to the number of concurrent jobs, and a 7-day runtime limit.
Preempted jobs will be automatically requeued, but this can be disabled on a per-job basis by users.
The existing "short", "medium" and "long" QOSs remain in place. As such,
existing job submission scripts do not need to be modified.
The cluster queuing system documentation is being updated to reflect this new
layout, and is available at:
https://www.pik-potsdam.de/services/it/hpc/user-guides/slurm
If you have technical questions or problems relating to this new
configuration, please send an email to cluster-support@pik-potsdam.de
Please submit any issues concerning the overall policy and QoS layout to the modelers council (via frank.hellmann@pik-potsdam.de).