New broadwell hardware partition on PIK cluster

Quickstart

The new broadwell partition has 1536 CPUs in total, each node having 32 CPUs with 128GB RAM or 3781 MB per core. You will need to use this in the SLURM launch scripts to run jobs on the new machines:

#SBATCH --partition=broadwell

The same QOSes with the same limits are supported as on standard partition (512 CPUs per job on short, 256 on medium and 8 on long). You can also use:

srun --partition=broadwell

for starting quick jobs from the login console.

If you do not specify --partition anywhere, your job goes on the standard partition, as always. 

If you specify --partition=broadwell, but no --qos then your job goes on the short QOS. 

If you use --exclusive anywhere in job launch this implies 16 CPUs on standard, but 32 on broadwell. Please make sure you understand the implications of using --exclusive as you may otherwise reserve double the usual CPUs unintentionally.

Since the new broadwell machines have 32 CPUs per machine, MPI users may want to adjust their launch scripts to reflect that:

--tasks-per-node=32

This also means that the maximum number of nodes that you can use on broadwell is 16 (instead of 32 as in standard), when using 32 MPI tasks per job since the short QOS is limited to 512 CPUs per job.

Please also note that jobs cannot dynamically go on one or the other partition depending on the queue state. If the partition that you specify in your job script is busy it is possible that you wait in the queue even if another partition is not fully busy. 

Builds that users have optimized for our previous Haswell architecture machines should work very well on the new machines, so there is no strict requirement to rebuild anything. 

What to expect from the new machines

The new machines have faster memory (2400Mhz vs 2133Mhz) and higher locality (16 cores per CPU connected through 2 NUMA memory interconnects per each CPU die).

They also have a lower base clock speed - 2.6Ghz on broadwell vs 3.5Ghz on the standard. Our early testing indicates that some workloads may experience slowdowns in the 2-15% range and others may experience speed-ups of up to 12%.

Overall we expect workloads that are limited by memory and process communication speeds to benefit from having more cores per machine and per CPU die, thus minimizing inter-node network communication requirements. In the same way, we also expect hybrid jobs (MPI+OpenMP/other) to benefit from being able to use more parallel threads on each node. This also implies that workloads that rely a lot on local parallelisms such as OpenMP, MKL threads, various multiprocessing paradigms such as in R, Python and Octave will benefit from having more CPUs and double the memory on each broadwell machine. We would especially encourage single-node jobs using non-MPI parallelization to consider using broadwell machines by default as having more threads per node may increase their efficiency. 

We would also like encourage jobs that have higher per-node memory requirements to go on the broadwell partition. The new broadwell machines have a bit higher memory per CPU allocation defined in SLURM configuration. Instead of 3500 GB per core as in standard, users on broadwell get 3781 MB per core. However, on a per-node basis, this results in a maximum allocation of 120 992 MB available to user processes. The rest of the memory is used by the operating system and the parallel filesystem daemon. Only if the total memory per node on broadwell is not sufficient, should the jobs go on ram_gpu partition nodes that have even more memory (245GB) available to user processes per node.  

More complex launch configurations

Users running jobs with hybrid parallelism (MPI + OpenMP / other local parallelisms) may need to further adjust things like --cores-per-task and OpenMP or MKL thread parallelism environment variables. The same applies to Python multiprocessing - you can distribute workloads to different nodes using mpi4py and then use multiprocessing-based parallelism on each node. In such case, you need to adjust the parameters on the thread pool to match the --cpus-per-node.

An example of a hybrid MPI+OpenMP or MKL job using 512 CPUs may look like this:

--nodes=16 # for 512 CPU job on broadwell
--tasks-per-node=1 # put 1 MPI task on each node
--cpus-per-task=32 # with 32 OpenMP threads running locally
export OMP_NUM_THREADS=32 # make OpenMP aware of your allocation
# if using MKL library - overrides OMP_NUM_THREADS
export MKL_NUM_THREADS=32

Please note that many third party libraries use OpenMP parallelization internally and you may not always be aware of this (scipy). The numpy library in Python is a notable user of MKL as is the newest R module builds on our cluster.

Compiler flags

The Broadwell architecture represents a die-shrink of Haswell – the architecture all our machines had until now. This means that the architectures are very similar. However, there are some new instructions available on Broadwell which may benefit some rather specific use-cases (ADX, RDSEED and PREFETCHW invented by AMD a long time ago). Notably, ADX can improve addition of large unsigned integers and further extends large integer arithmetics beyond MULX that was already available on Haswell.

While compiler tuning will always heavily depend on the workload, if you are looking for the simplest defaultI would recommend to always use the latest Intel compilers together with CORE-AVX2 flags like this:

export CFLAGS="-xCORE-AVX2 -mtune=CORE-AVX2" # Intel C
export CXXFLAGS="-xCORE-AVX2 -mtune=CORE-AVX2" # Intel C++
export FFLAGS="-xCORE-AVX2 -mtune=CORE-AVX2" # Intel Fortran

This will build software that is suitably optimized and compatible with all our machines. Applying these flags enables your software to use modern CPU instruction sets. If you do not apply them your software is typically built for CPUs that are more than 15 years old (first generation of AMD Athlon / Opteron). These flags differ from "-O3" or "-O2" that should be better chosen by the developers most familiar with the software being built as some capabilities enabled by -O3 may break builds in some rare circumstances. 

If you need to use GNU (GCC) compilers then the equivalent architecture flag is “-march=nativewhen building on login nodes. This enables instruction sets that will be compatible and suited for all our compute nodes in all partitions.

If you choose to use highly specific tuning flags on either Intel or GNU compilers, please note that “-mtune=broadwell” may not bring the best performance on Haswell machines.

Finally, if you explicitly specify “-march=broadwell” when building with GNU compilers, the program will fail on Haswell machines (including login nodes), so everywhere except broadwell.