Minimizing job wait times and getting more done on the cluster


During the recent discussions about job wait times in the cluster queue, it has come to my attention that not all our users are aware of the existing options to improve these on our current configuration.

There certainly is a minority that is aware and using them. To give others who are unaware of this a more level playing field I have compiled here some practices that will improve the wait time of jobs in the cluster queue. Following these practices will give you an advantage over those users who do not.

1. Estimate the time-limit for your jobs

The time-limit you ask for when you submit a job is among the strongest factors influencing the wait time for your jobs when the cluster is at full load. Each time you submit a SLURM job you are asking for a specific amount of run-time on the machine. When you do not specify it – SLURM assumes the default QOS run-time, so 24-hours (for short and io) or more (for medium and long). Among 197 users who had at least 1 completed job in 2017, only 76 set a non-default time-limit at least once. These users had on average 9x lower wait times in the queue. Considering that the average run-time of a completed job in 2017 is only 70 minutes there is ample room for improved run-time estimates. When we look at the completed jobs instead of users - 77% in 2017 were submitted with the default run-times. Among COMPLETED jobs (in all QOSes) that ran for more than 1 hour, the average run-time is 8h. Even this is still significantly lower than all of the default time-limits.

This situation illustrates why even most basic attempts at estimating the run-time for your jobs can yield notable impact on lowering wait-times. When you ask for as much run-time as the next user - you compete with all of them and your wait time is determined by your Fairshare factor. When you ask for (a lot) less – you only compete with jobs that can fit inside time-allocations that suddenly become available when other jobs end sooner than expected.

Now, of course, you do not want your jobs to be killed by asking for a time-limit that is too small (unless you have a very reliable restart mechanism). Also, not all job run-times are predictable as IO throughput depends on the activity of other users on the system and not all experiments scale in a predictable way. But if you do have a reasonable way to predict the run-time for your jobs with, let’s say, 25% extra margin for error - it is in your interest to do this.

You can ask SLURM for a custom time-limit in 3 ways:

1) Adding --time to your job launch scripts as in this example:

# request for 1.5 hours of run-time
#SBATCH --time=01:30:00

2) Adding a command line parameter to sbatch call without changing existing job script:

sbatch --time=01:30:00

This is the same as if you modified the job launch script, but allows you to tweak time-limit from the command line each time you call sbatch.

3) You can also use --time with srun to do launch quick tests as SLURM jobs. It is better to attach a small time-limit to such tests so that they can be scheduled to run with the smallest wait time:

# ask for 1 CPU and 1 second of time
srun --time=00:00:01 hostname

Using srun to directly start SLURM jobs is an entirely valid way to work in many simpler cases. What happens when you do this is that SLURM creates a job in your name with the name of the program that you call after srun and pipes stdin and stderr to your terminal instead of the .out and .err files you normally specify in sbatch launch script. Closing the terminal will not cancel your jobs started this way, but you will lose the stdin and stderr logs.

Using srun is especially handy for testing out how to launch programs with lots of command line parameters when you want to tweak these repeatedly. It is also very useful if you wish to just check if your program starts successfully as part of development process. For doing such tests you may benefit from specifying very small time-limits (in seconds). Even at 100% cluster load, you can usually find a spare second as there are on average about 15 jobs ending every minute (data from the busiest month - August 2017) and the vast majority of them will allow the backfill scheduler to free-up some resources for your tests between starting jobs with longer time-limits.

1.1. How to estimate meaningful time-limits and see basic profiling data 

The most straight-forward way to do this is to use SLURM sacct command with a -S as a start of the time period. You may additionally wish to look at the jobs that were logged as completed.

# all your COMPLETED jobs since beginning of March 2018
sacct -S2018-03-01 --state COMPLETED

# or just jobs COMPLETED in the last 14 days 
sacct -S`date -d "14 days ago" --iso` -s COMPLETED 

If you have a lot of jobs in your account you can use grep to filter the results further by job names used in the launch scripts and other characteristics. If you consider them representative samples for estimating future run-times, you may wish to add another 5-25% safety margin for extra peace of mind and use that as the default time-limit for similar future jobs.

You can also use –parsable2 and write the results to a text file to import as a table in your preferred analysis environment:

sacct -S2018-03-01 -s COMPLETED --parsable2 --long > myjobs.txt

The --long parameter will also output basic profiling information stored in the SLURM accounting database. Among the data SLURM collects from your jobs you may find these columns to be interesting: AveCPUFreq MaxDiskRead MaxDiskWrite MaxRSS MaxVMSize and others. Please consult the SLURM sacct documentation about these "job accounting fields" to learn more. Please note that the ConsumedEnergy field is not reported on our cluster due to the negative performance impact from polling this. 

2. Specify correct account and QOS for your jobs

2.1. Account (your groups use of resources) affects your priority

The --account parameter in launch scripts is actually your group. It is possible to belong to many groups and the average number for our users being three and some users being in more than 10.

If you do not use the correct –account line at all and belong to multiple groups, you may end up waiting longer. Your groups (SLURM account’s) past usage of compute resources affects how your Fairshare factor is calculated – members of groups that are major resource users wait longer when the cluster is 100% utilized. For each group that you belong to, you will have a different Fairshare factor (see the output of “sshare -U" command). If you are actually working for a minor compute resource user group and do not specify the correct “--account” parameter you could end up waiting as long as if you belong to a major compute resource user group which you are also a member of (for collaboration reasons, for example).

You can specify which group is (at PIK for statistical purposes only) is "charged" the compute time with:

#SBATCH --account=mygroup

In any case - you should always correctly identify the account that actually benefits from the jobs you are running and it is up to the group managers to enforce this.

2.2. QOS also affects your scheduling priority

The same way as your group’s use of resources affects your priority, the QOS you submit your job to will also influence the priority calculation - this has always been the case. The logic behind this is that if you ask for a week of run-time it makes sense that these jobs could wait a bit longer than those that are submitted to short.

Jobs that are submitted to the medium queue have 50% lower priority factor than the short QOS and long has just 10% priority of short.

This matters if you wish to do a lot of quick testing using scripts intended to run in medium or long QOS. In such test cases, it makes sense to also set the QOS to short. Also, if you already know that your job will finish sooner than 24 hours, it is in your interest to always make sure you actually submit it to short, instead of just leaving the QOS parameter to what you used last time.

3. Do not wait for cluster load to decrease – submit your jobs ASAP

The maximum load on the cluster tends to occur closer to the middle of the week and the loads tend to be lighter on Monday mornings, Friday afternoons, evenings and weekends. While it is clear that many people test their jobs during working hours, we also seem to observe that some of our users feel discouraged to submit their jobs when they see a long batch queue in front of them. Waiting for the cluster utilization to decrease may seem as an intuitive decision, but it is not a good one.

When you decide to wait for a less busy period the decision will typically happen during the mid-week time of high loads. Thus if you just check the cluster load occasionally and decide that you will submit your job sometime later when the load appears to be lower, you could end up submitting the job on the next day or two and you have now lost that time. It may be useful to remember that in 2017 90% of the jobs in the short QOS (to which 77% of jobs are submitted) will start under 5 minutes. In other words – it is not worth waiting for another time to submit a job even if the load appears to be high.

Note: Unless otherwise stated, the statistics mentioned in this article were compiled for the Jan 2017 - Oct 2017 time period. Not all, but many of the stats are available here. Feel free to contact me for details. Janis