HPC compressed streaming IO libraries

"A supercomputer is a device for converting a CPU-bound problem into an I/O bound problem.” quote attributed to Ken Batcher.

The recent growth in distributed commercial applications has made IO the focus of many developers and large software companies. Several new developments aimed principally at high-speed streaming IO applications have recently seen rapid and wide adoption, most commonly zstd, brotli and lz4. While the most common application for compression software from the user perspective is to archive, back up and transfer some files, the tools described below are primarily aimed at developers working with large streams of data in distributed environments, like our Cluster. 

The majority of modern high-speed compressors actually re-implement a small subset of very classic algorithms. Most commonly these are variations of the Lempel-Ziv (LZ) family (1977) or Burrows-Wheeler Transform (BWT), also known as block-sorts (1994). Both have widely used implementations as gzip and bzip2 on most unix-like systems. Both of these timeless utilities fall far short of demands in HPC environments.

Many of the following tools provide re-engineered LZ algorithms with better performance, especially for faster decompression speeds. LZ methods are asymmetric - they decompress much faster than compress. Block-sorting algorithms are symmetric – they work at similar speed compressing and decompressing, but can offer much better compression than LZ implementations. Due to their “block-wise” nature, they also offer predictable scaling behaviour.

Our cluster is now equipped with some of the most popular compression libraries currently. When you build your own software and you find that some components support any of these compressors, it will usually be in your interest to enable this functionality and use these modules as dependencies.

module load lizard/1.0.0

Lizard is the fastest streaming LZ implementation around – at compression level 1 you can easily double the IO throughput from out parallel filesystem using lizard, but the compression ratio at these speeds is low. Still, Level 1 on our Broadwell machines can give you up to 6.5GB/s of reading and ~750MB/s write compressing data to ~58% in a synthetic benchmark. For reference - 7GB/s is the upper theoretical limit per-node that our filesystem can do. In practice, it usually works at 1-3GB/s range without any compression. Lizard is the newest evolution of LZ4 and Zstd which have been around for a few years now. If you have reasonably redundant data and raw speed is what you want - Lizard has it.

module load utilities/zstd/1.3.4

Zstandard provides much better compression (to about 33% of original in the same benchmark as lizard), but decompresses at about 1.2 GB/s on our Broadwell machines. It was developed by Facebook and is now a common dependency in many applications, including HDF5, the high-speed fst data structure for R and many others. Zstd is fully open-source and based on LZ family of algorithms just like lizard. 

module load lz4/

LZ4 is often used in the open-source community and has been around for a while - it is one of the methods supported for decompressing of Linux kernel boot images to speed up the boot process. It represents a well-established competitor among high-speed compressors that is much more common than lizard. The level 1 on lz4 performs close to the speeds attained by lizard, but its other compression levels may provide a various useful trade-offs between speed/compression/decompression.

module load brotli/1.0.2

Brotli is a lightweight high-speed library developed by Google that is often used for compressing web content and is a build option for modern versions of curl.

module load lbzip2/2.5

BWT algorithms such as implemented in bzip2 provide a very good balance for scalability and compression ratio. Lbzip2 is the most efficient implementation compatible with bzip2 data format. It is much faster and scales to as many CPUs as you assign it. If you require bz2 compatibility lbzip2 is the most performant implementation around.

module load pv

PipeViewer is a utility that allows you to measure data-flow though unix pipes. It is useful for testing IO stream between compressor and disk, for example like this:

pv /p/tmp/jatnieks/rand.bin > /dev/null

This example will read 10GBs of random data saved in my home directory and indicate the typical parallel filesystem read speed per node of ~1-3GB/s. You can use pv to get an estimate of how well any of the compressors perform on your data.