Cluster Update January 2018 notes


New broadwell hardware partition in SLURM

The new broadwell partition has 1536 CPUs in total, each node having 32 CPUs with 128GB RAM or 3781 MB per core. You will need to use this in the SLURM launch scripts to run jobs on the new machines:

#SBATCH --partition=broadwell

The same QOSes with the same limits are supported as on standard partition (512 CPUs per job on short, 256 on medium and 8 on long). Please read the instructions here for more details.

Intel Parallel Studio XE Cluster Edition 2018 Update 1  
All users are encouraged to test new builds of their software using newest Intel toolchain and and adopt it as the default.

module load intel/2018.1

The newest Intel tools release represents 4 upgrade steps since 2017.4 and you can find the full release notes here.

Some of the updates described in the release notes, that may be of most interest to many of our users are: 1) full C++14 and initial C++ 2017 draft, 2) full Fortran 2008 and initial Fortran 2015 draft; 3) MKL has PKG_CONFIG_PATH environment set that should greatly simplify its inclusion in may builds.

Please keep in mind that since the first 2017 release, our intel modules contain not just compilers, but also the newest Intel MPI tool-chain, MKL math libraries and every other Intel XE Tools Cluster Edition component. This means you do not need load Intel compilers and MPI separately as everything is available through this single module.

Modules from compiler/intel and parallel/mpi paths are obsolete and maintained only for backwards compatibility. All users are encouraged to update their launch scripts, build recipes and applications to use the newest Intel toolchain at their nearest convenience.

The only part of Intel tools that is not actually invoked using the intel modules is the Intel Distribution of Python, which is distributed through anaconda, avilable with anaconda modules described blow.

Anaconda/5.0 with Intel Distribution for Python

Older versions of anaconda modules remain operational. However, the Anaconda/5.0.0_py3 module is available since November 2017 and contains recent python libraries and the most recent Intel Distribution for Python with performance optimized builds for many components of the Python ecosystem, including the Python interpreter itself together with other performance-oriented improvements. To use it do:

module load anaconda/5.0.0_py3 # Anaconda with Python 3
source activate idp # intel distribution for Python environment

IBM Spectrum Scale (GPFS) parallel file-system updated

After the upgrade, the IOR benchmark shows read/write performance that is almost double the last measurements performed in October 2017. We observe typical IOR speeds around ~60GB/s read and ~30GB/s write using 256 MPI tasks on 16 nodes. We would expect IO-bound applications to experience notable improvements in their performance.

SLURM 17.11.2-1

The new SLURM 17.11.2-1 has been installed instead of the previous 16.05.8 with many important bug fixes. No changes in existing user launch scripts are foreseen. You can find the detailed SLURM release notes for our current version here.

One notable feature in the new SLURM is the ability to create heterogeneous jobs. This means that job steps can have separate configurations, including the ability to run on different partitions. Expert users can review this functionality in SLURM documentation here.

CUDA 9.0 

Users interested in development of GPGPU applications may benefit from a newer nVidia CUDA toolkit version available with 

module load cuda/9.0

One of the more interesting changes in the new CUDA is that the nvcc compiler adds support for C++14 language spec. 

Please keep in mind that you should only invoke this module on ram_gpu partition nodes that are equipped with GPUs that you request using

#SBATCH --gres=gpu:1 

in launch scripts. Or alternatively from vis[01-02] nodes that are also equipped with GPUs. In all other cases, a program attempting to load an nVidia driver on a machine with no nVidia GPU will crash on start (this has always been the case and does not represent a change - simply a reminder). 

SLES 12 SP1 to SP2

The largest notable change is upgrading of kernel from 3.12 to 4.4.103 and many of the system libraries such as the OpenSSL. The full release notes of SLES 12 SP2 can be found here.

Common libraries

curl/7.58.0 zlib/1.2.11 libiconv/1.15  libssh2/1.8.0 texlive/2017 json-c/0.13 fftw/3.3.7 gsl/2.4

A more recent release of zlib fixes some bugs that can cause corruption of data.

The most recent version of curl contains hundreds of security fixes and updates. Loading newest curl will also load the newest zlib and brotli.

GNU libiconv is a very common dependency that is used for converting between different character set encodings. It is one of the requirements for building R. 

Libssh2 has been rebuilt with the new OpenSSL from SUSE system libraries instead of the older openssl module that causes conflicts if loaded. 

You may wish to choose these newer versions when building software on our cluster.

GCC 7.3 and GNU toolchain

Available together with dependencies gmp/6.1.2 mpfr/4.0.0 mpc/1.1.0 isl/0.18 binutils/2.30 cmake/3.10.2 automake/1.16

We recommend users to always use the latest available Intel toolchain including, compilers, MPI, MKL libraries and the rest with “module load intel”.

For those that require the use of GCC, it is better to use the newest stable GCC 7.3. The 7 of GNU compilers series is a much more evolved compiler than 5 and 6 series of GCC and the popular Phoronix benchmarks point to 7.x compilers as generating more performant binaries in many scientific benchmarks. If in doubt, but really need GCC – use this module:

module load compiler/gnu/7.3.0

This GCC module includes support for c,c++, fortran, lto, jit, objc, obj-c++ languages, but not ada or go. Let me know if you require ada or go languages. MPI programs are strongly encouraged to use Intel MPI and toolchain. 

Building GCC requires also GMP, MPFR, MPC and ISL libraries. These have been built and set up as new modules as they are responsible for some low-level math APIs and may occasionally come in handy for users building other programs, not just GCC. All have been “make check”ed before installation.

Updating compiler toolchain usually is a good time to upgrade to the most recent binutils 2.30, automake 1.16, make 4.2.1 and the newest cmake 3.10.2. The make module allows to also run build jobs with SLURM on compute nodes. 

Please make sure you load the cmake module explicitly with the 3.10.2 version as otherwise the older cmake module is loaded due to the alphanumeric sorting of the version numbers. The same way as the previous cmake module, the new one depends on the GNU compiler toolchain (7.3.0 in this case) and also curl, zlib and brotli. 

Valgrind 3.13.0

The most recent valgrind release has been built using GCC 7.3.0 and is available with:

module load valgrind/3.13.0

HDF5 1.8.20 and HDF5 1.10.2 built using Intel 2018.1

For those wishing to use a highly optimized build of long-term stable version of HDF5 1.8.20 (November 28, 2017) has been built with the Intel 2018.1 toolchain and is available as hdf5/1.8.20/intel/parallel or hdf5/1.8.20/intel/serial module.

Since HDF5 can only be built with Intel 2018.1 starting from version 1.10.2, it has been added to modules only recently (April 2018). The build is heavily performance optimized and has also passed all make checks. MPI users will need to add this to their modules to use it:

# MPI users need this 
module load hdf5/1.10.2/intel/parallel/parallel
# OpenMP and all other non-MPI users need this
module load hdf5/1.10.2/intel/parallel/serial

NetCDF-C 4.6.1 built using Intel 2018.1

The most recent NetCDF-C release from March 20, 2018, is available as a serial and parallel build in these modules:

# MPI users need to load these modules
module load hdf5/1.10.2/intel/parallel
module load netcdf-c/4.6.1/intel/parallel
# OpenMP and all other non-MPI users need these modules
module load hdf5/1.10.2/intel/serial
module load netcdf-c/4.6.1/intel/serial

GAMS 25.0.2  

The newest version of GAMS from Janury 31 2018 release is installed and available as

module load gams/25.0.2

PROJ 5.0.1

After more than 25 years the project that formerly was known as PROJ4 has reached a next major release. While the first release was notorious for failing tests, the first fix release passes all tests and has been built with the native instruction sets for our cluster on GCC 7.3.0. All newest datum-grids are included in this module. 

module load proj4/5.0.1

GDAL 2.2.4 performance optimized build

The most recently available version of gdal has been built using Intel toolchain with full architecture optimizations, Interprocedural optimizations, Intel MKL, libm and OpenMP. In addition to previous builds on our cluster, Intel HDF5 and recent versions of most other libraries have been used for this build:

module load gdal/2.2.4

Loading this module will also load Intel and curl modules as dependencies.

LLVM and Clang versions 6.0.0 and 5.0.1

As mentioned previously, we recommend the Intel compilers and toolchain for our users by default.

However, many recent advances in scientific computing and performance improvements require the use of technologies from the LLVM and Clang. Clang is a comparatively young compiler, but is rapidy becoming more popular and therefore more likely to be encountered by our users as a build dependency for some applications which is why they are being made available as cluster modules:

module load compiler/llvm/6.0.0 # Clang is part of LLVM

LLVM and Clang are also starting to be more widely used in industry to due to many of the flexibility, multiplatform support, faster tracking of C++ standards and JIT features that are become parts of many performance acceleration approaches. 

If this is the first time your hear about LLVM or Clang, you may wish to read this. You may also find these Phoronix benchmarks (first and second) informative. 

Julia 0.6.2

The Julia language is one such novel technollogy that is gaining momentum as analytics and scientific programming language which depends on the LLVM toolchain. 

module load julia/0.6.2

The Julia library DifferentialEquations.jl my be of partiular interest to users at PIK. A deeper introduction to Julia can be found here.

Known issues and changes

Intermittent login node failures

We are experiencing login node crashes every 1-2 days. This issue emerged suddenly after we returned the machines to production and has been promised a fix from IBM support in February. If you suddenly lose connection to the cluster, please try again in a few minutes as we are aware of the issue and rebooting the nodes. Please do NOT regenerate your keys or assume any other type of problem.

You can also use this network-filesystem manual to configure mounts to the parallel filesystem, home and project directories that will work even when login nodes are unavailable. Please note that this does not affect any jobs running on the cluster, only interactive sessions on login01 and login02. 

MemPerCPU lowered on standard partition

Due to issues with low memory reported by SLURM node daemons after SLES 12 upgrade to SP2 the MemPerCPU has been reduced from 3584 MB to 3500 MB per core to leave more RAM for the file-system daemonsThis reduces the maximum available memory for a single-node job on the standard partition from 57.34 GB to 56GB. The intention is that jobs that require more memory go on the new broadwell partition.

_ncview: error while loading shared libraries:

Users experiencing issues starting the special version of _ncview need to load the following modules in exactly this order:

module load intel/2017.4 # intel/2017.1 can be used as well
module load hdf5/1.8.15p1/serial

Users who require the use of parallel builds of hdf5 can also use this module of hdf5 instead of the hdf5/1.8.15p1/serial (above):

module load hdf5/1.8.18/intel/parallel

OpenSSL version mismatch. Built against 100020af, you have 100010af

If you get this error message this means that the program you are using expects to find the new OpenSSL 1.0.2 from the system libraries, but you have

module load openssl/1.0.1j

being loaded somewhere in your .profile, .bash_rc, job script or it is loaded by another module non-transparently. Unloading and not loading openssl/1.0.1j resolves this.

git/2.16.1 module available, git 2.12 is system default (no modules)

Since the upgrade, the default system git version is 2.12 (instead of 1.18). This is a comparatively recent release which works without any modules. 

The latest release from January 2018 can be invoked with:

module load git/2.16.1

This git module has been built with all the newest libraries, such as curl/7.58.0, zlib/1.2.11 and iconv/1.15.

Those users that load older git modules may run into the openssl version mismatch error (above). Make sure you do not “module load openssl/1.0.1j” to avoid this error when using older versions of git that are based on modules (see above).

Updates to sload and sclass commands

To get the % utilization across all QOSes you can use the sclass command which now accounts for the new broadwell machine allocations.

You may also want to use the sload command to see if you have misconfigured your jobs in a way that attempts to use more CPUs then there are on the machine. The updated version of sload will list the jobs running on machines with higher CPU loads than CPUs actually present and may help in debugging job configuration problems. It is not a bad idea to type sload after your job starts if you change the number of tasks per node fairly often and run large numbers of jobs.