Compressor Microbenchmark

Symas Corp., February 2015

Using the same DB engines as in our previous On-Disk and In-Memory microbenchmarks, we investigate the effectiveness of various compression libraries on the database size and performance. The libraries are as provided in Ubuntu trusty (14.04), using libbz2 1.0.6-5, liblz4 0.0-r114, liblzma 5.1.1alpha, liblzo2 2.06-1.2, libsnappy 1.1.0-1, and zlib1g 1:1.2.8.dfsg. Not all of the DB engines have native support for all of the compression libraries, so the PolyZ wrapper was used where needed.

Some of the DB engines also have particular compression algorithms statically built into their code bases. E.g., TokuDB has its own QuickLZ and LZMA implementations. These are tested as well, for comparison purposes.

Due to the large volume of test data here, each DB engine's results are presented on their own page.

LevelDB

Basho LevelDB

HyperLevelDB

LMDB

RocksDB

TokuDB

WiredTiger Btree

WiredTiger LSM

Conclusion

Adding compression to a database engine throws another wrench into the space/time tradeoff considerations. For pure in-memory workloads it generally won't make sense: these workloads generally demand maximum speed, above all else, and usually a compressor will just add complexity and slow down the DB accesses. There may be cases where compression shrinks the data enough to yield better cache utilization but this is pretty rare.

Compression makes the most sense for larger-than-memory workloads, because it both allows more records to be held in memory at once, and the reduced size potentially speeds up I/O operations. (In aggregate, transfer time per original uncompressed byte is decreased.)

Back in the 1980s when Howard was developing file compression algorithms, compression was all about saving storage space - cramming as much as possible into archives stored on floppy disks. These days with hard drives going for around $30/TB space isn't much of a concern any more; it's all about transfer speed - from memory to disk, or across a network link.

The small in-memory tests conducted here give an idea of the absolute maximum speed of the respective compressors, but they don't really reflect how disk-bound workloads will be affected. That will come in a future benchmarking effort.

Files

The files used to perform these tests are all available for download.

The source code for the benchmark drivers is all on GitHub. We invite you to run these tests yourself and report your results back to us.

The software versions we used:

Software revisions used:

violino:/home/software/leveldb> g++ --version
g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

violino:/home/software/leveldb> git log -1 --pretty=format:"%H %ci" master
e353fbc7ea81f12a5694991b708f8f45343594b1 2014-05-01 13:44:03 -0700

violino:/home/software/basho_leveldb> git log -1 --pretty=format:"%H %ci" develop
d1a95db0418d4e17223504849b9823bba160dfaa 2014-08-21 15:41:50 -0400

violino:/home/software/db-5.3.21> ls -l README 
-rw-r--r-- 1 hyc hyc 234 May 11  2012 README

violino:/home/software/HyperLevelDB> git log -1 --pretty=format:"%H %ci" master
02ad33ccecc762fc611cc47b26a51bf8e023b92e 2014-08-20 16:44:03 -0400

violino:~/OD/mdb> git log -1 --pretty=format:"%H %ci"
a054a194e8a0aadfac138fa441c8f67f5d7caa35 2014-08-24 21:18:03 +0100

violino:/home/software/rocksdb> git log -1 --pretty=format:"%H %ci"
7e9f28cb232248b58f22545733169137a907a97f 2014-08-29 21:21:49 -0700

violino:/home/software/ft-index> git log -1 --pretty=format:"%H %ci" master
f17aaee73d14948962cc5dea7713d95800399e65 2014-08-30 06:35:59 -0400

violino:/home/software/wiredtiger> git log -1 --pretty=format:"%H %ci"
1831ce607baf61939ddede382ee27e193fa1bbef 2014-08-14 12:31:38 +1000