Cluster QOS changes for GPU partition

 

As some of you are aware, high load on the cluster can lead to GPU nodes being unavailable for jobs, even if they're idle. We've therefore been planning some changes to the QOS layout on those nodes which will require changes to your job submission scripts if you're a GPU user.

The modifications will be carried out on the 1st December 2022. Because QOSs are being removed, we can't avoid blocking access to these nodes for at least part of the day. Access for long jobs has already been blocked since 1st November, medium jobs will be blocked on 24th November and short jobs will no longer be accepted from 30th November. (As usual, these QOSs can still be used if you specify a time-limit which ensures job completion before 1st December)

Here is a brief rundown of the expected changes:

  • We will replace short, medium and long QOSs on the GPU nodes (partition gpu) with gpushortgpumedium and gpulong with the same time-limits. The relative QOS sizes will be the same, i.e. 100% of the partition can be used for gpushort, 50% for gpumedium and 25% for gpulong
  • We will add a QOS gpupreempt, a QOS for long running jobs which may be preempted (i.e. cancelled) by jobs submitted to the three main QOSs. This QOS will allow you to make use of the GPU nodes for very long running jobs while not blocking access if demand for GPUs for shorter jobs increases. Check the Slurm documentation on preemption for more information.

Required actions:

  • Replace QOS names (short, medium, long) in job submission scripts with the new names (gpushort, gpumedium, gpulong or gpupreempt)

The documentation will be updated once the changes have been completed.

Tip: If you accidentally submit with the wrong QOS you can modify the queued job (and avoid cancelling+resubmitting) like this:

    scontrol update job=<jobid> qos=<new QOS name>