In-Memory Microbenchmark

Symas Corp., June 2014

This RocksDB performance report prompted us to investigate how other embedded storage engines would fare in a similar workload. For this test we used LMDB, BerkeleyDB, Google LevelDB and 3 of its derivatives - Basho LevelDB, HyperLevelDB, and Facebook's RocksDB, as well as TokuDB and WiredTiger.

This selection of engines is interesting because they're all closely related in one way or another: obviously the LevelDB databases are all of a breed, all using Log Structured Merge (LSM) trees as their main data store. LMDB and BerkeleyDB are both Btree engines. WiredTiger comes from the people who created BerkeleyDB, and it offers both LSM and Btree engines. Both LMDB and TokuDB's APIs are based on the BerkeleyDB API. TokuDB is unique here, being the only engine implementing Fractal Trees. (And since Fractal Trees are patented, they will probably remain the sole implementation for some time to come...)

The RocksDB test focuses on multithreaded read performance for a purely in-memory database. As such, none of the tests shown here involve any disk I/O and the data sets are chosen to ensure they fit entirely in the test machine's RAM.

Tests were conducted on two different test environments - one an Asus NP56D laptop with 16GB of RAM and quad core AMD A10-4600M APU, and an HP DL585 G5 server with 128GB of RAM and 4 quad core AMD Opteron 8354 CPUs.

Note: as with the microbench reports we published earlier, we did not design the test scenarios. The LevelDB authors designed the original microbench scenario, and the RocksDB authors designed this one. We have received criticism for running tests that are artificially biased to show LMDB in its best light. Such statements are nonsense; we didn't design the tests. If you have issues with how the tests were designed, take it up with the LevelDB or RocksDB authors, respectively.

1. Footprint

One of the primary reasons to use an embedded database is because one needs something lightweight with a small application footprint. Here's how the programs in this test stack up, using identical driver code and with their DB libraries statically linked into the binaries to show the full DB code size. (Note that this still fails to take into account other system libraries that some engines use that others don't. E.g., Basho and RocksDB also require librt, the realtime support library.) We've also listed each projects' size in lines of source code, as reported by Ohloh.net.

size db_bench* text data bss dec hex filename Lines of Code 285306 1516 352 287174 461c6 db_bench 39758 384206 9304 3488 396998 60ec6 db_bench_basho 26577 1688853 2416 312 1691581 19cfbd db_bench_bdb 1746106 315491 1596 360 317447 4d807 db_bench_hyper 21498 121412 1644 320 123376 1e1f0 db_bench_mdb 7955 1014534 2912 6688 1024134 fa086 db_bench_rocksdb 81169 992334 3720 30352 1026406 fa966 db_bench_tokudb 227698 853216 2100 1920 857236 d1494 db_bench_wiredtiger 91410

size db_bench*
text	data	bss	dec	hex	filename	Lines of Code
285306	1516	352	287174	461c6	db_bench	39758
384206	9304	3488	396998	60ec6	db_bench_basho	26577
1688853	2416	312	1691581	19cfbd	db_bench_bdb	1746106
315491	1596	360	317447	4d807	db_bench_hyper	21498
121412	1644	320	123376	1e1f0	db_bench_mdb	7955
1014534	2912	6688	1024134	fa086	db_bench_rocksdb	81169
992334	3720	30352	1026406	fa966	db_bench_tokudb	227698
853216	2100	1920	857236	d1494	db_bench_wiredtiger	91410

LMDB is still the smallest by far.

2. Small Data Set

Using the laptop we generate a database with 20 million records. The records have 16 byte keys and 100 byte values so the resulting database should be about 2.2GB in size. After the data is loaded a "readwhilewriting" test is run using 4 reader threads and one writer. All of the threads operate on randomly selected records in the database. The writer performs updates to existing records; no records are added or deleted so the DB size should not change much during the test.

The tests in this section and in Section 3 are all run on a tmpfs, just like the RocksDB report. I.e., all of the data is stored only in RAM. Additional tests using an SSD follow in Section 4.

The pertinent results are tabulated here and expanded on in the following sections.

Engine Load Time Overhead Load Size Writes/Sec Reads/Sec Run Time Final Size CPU% Process Size

Wall User Sys KB Wall User Sys KB KB

LevelDB 00:34.70 00:44.72 00:06.70 1.4818443804 2246004 10232 26678 00:49:58.73 01:31:48.62 00:52:50.95 3452388 289% 2138508

Basho 00:40.41 01:24.39 00:17.82 2.5293244246 2368768 10232 68418 00:19:32.94 01:14:10.04 00:01:19.19 2612436 386% 6775376

BerkeleyDB 02:12.61 01:58.92 00:13.57 0.9990950909 5844376 9565 86202 00:15:28.44 00:42:07.97 00:17:27.49 5839912 385% 3040716

Hyper 00:38.78 00:49.88 00:06.43 1.4520371325 2246448 10208 138393 00:09:38.39 00:35:06.12 00:02:06.18 2292632 385% 2700088

LMDB 00:10.55 00:08.15 00:02.37 0.9971563981 2516192 10224 1449709 00:00:55.46 00:03:37.63 00:00:01.67 2547968 395% 2550408

RocksDB 00:21.54 00:34.70 00:05.99 1.8890436397 2256032 10233 91544 00:14:37.74 00:54:06.84 00:02:38.04 3181764 387% 6713852

TokuDB 01:45.12 01:41.58 00:47.37 1.4169520548 2726168 9881 109682 00:12:12.91 00:37:41.45 00:07:10.03 3920784 367% 5429056

WiredLSM 01:10.93 02:35.55 00:18.62 2.4555195263 2492440 10230 179617 00:07:26.24 00:28:55.85 00:00:07.76 2948988 390% 3205396

WiredBtree 00:17.79 00:15.68 00:02.09 0.9988757729 2381876 10021 752078 00:01:53.46 00:06:36.98 00:00:14.78 4752568 362% 3415468

Engine	Load Time	Overhead	Load Size	Writes/Sec	Reads/Sec	Run Time	Final Size	CPU%	Process Size
	Wall	User	Sys		KB		Wall	User	Sys	KB		KB
LevelDB	00:34.70	00:44.72	00:06.70	1.4818443804	2246004	10232	26678	00:49:58.73	01:31:48.62	00:52:50.95	3452388	289%	2138508
Basho	00:40.41	01:24.39	00:17.82	2.5293244246	2368768	10232	68418	00:19:32.94	01:14:10.04	00:01:19.19	2612436	386%	6775376
BerkeleyDB	02:12.61	01:58.92	00:13.57	0.9990950909	5844376	9565	86202	00:15:28.44	00:42:07.97	00:17:27.49	5839912	385%	3040716
Hyper	00:38.78	00:49.88	00:06.43	1.4520371325	2246448	10208	138393	00:09:38.39	00:35:06.12	00:02:06.18	2292632	385%	2700088
LMDB	00:10.55	00:08.15	00:02.37	0.9971563981	2516192	10224	1449709	00:00:55.46	00:03:37.63	00:00:01.67	2547968	395%	2550408
RocksDB	00:21.54	00:34.70	00:05.99	1.8890436397	2256032	10233	91544	00:14:37.74	00:54:06.84	00:02:38.04	3181764	387%	6713852
TokuDB	01:45.12	01:41.58	00:47.37	1.4169520548	2726168	9881	109682	00:12:12.91	00:37:41.45	00:07:10.03	3920784	367%	5429056
WiredLSM	01:10.93	02:35.55	00:18.62	2.4555195263	2492440	10230	179617	00:07:26.24	00:28:55.85	00:00:07.76	2948988	390%	3205396
WiredBtree	00:17.79	00:15.68	00:02.09	0.9988757729	2381876	10021	752078	00:01:53.46	00:06:36.98	00:00:14.78	4752568	362%	3415468

Loading the DB

The stats for loading the DB are shown in this graph.

The "Wall" time is the total wall-clock time taken to run the loading process. Obviously shorter times are faster/better. The actual CPU time used is shown for both User mode and System mode. User mode represents time spent in actual application code; time spent in System mode shows operating system overhead where the OS must do something on behalf of the application, but not actual application work. In a pure RAM workload such as this, where no I/O occurs, ideally the computer should be spending 100% of its time in User mode, processing the actual work of the application. Both LMDB and WiredTiger Btree are close to this ideal.

The "Overhead" column is the ratio of adding the User and System time together, then dividing by the Wall time. It is measured against the right-side Y-axis on this graph. This shows how much work of the DB load occurred in background threads. Ideally this value should be 1, all foreground and no background work. When a DB engine relies heavily on background processing to achieve its throughput, it will bog down more noticeably when the system gets busy. I.e., if the system is already busy doing work on behalf of users, there will not be any idle system resources available for background processing.

Here the 3 Btree engines all have an Overhead of 1.0 - they require no background processing to perform the data load. In contrast, all of the LSM engines require significant amounts of processing to perform ongoing compaction of their data.

This graph shows the load performance as throughput over time:

It makes the difference in performance between the DB engines much more obvious. BerkeleyDB clearly adheres to the "slow and steady" principle; its throughput is basically constant. Basho shows wildly erratic throughput. The others are all fairly consistent at this small data volume.

Run Time

The stats for running the actual readwhilewriting test are shown here.

The test duration is controlled by how long it takes for the 4 reader threads to each read 20 million records. The total User and System time is expected to be much larger than the Wall time since a total of 5 threads are running (4 readers and 1 writer). Ideally the total should be exactly 4x larger than the Wall time. How close each DB reaches the ideal is shown in this graph:

Google LevelDB shows the worst scaling; it isn't even able to make full use of 3 CPU cores.

Performance

The actual throughput in operations per second is shown in this graph.

The left axis measures the Write throughput and the right axis measures the Read throughput. The writers were constrained to no more than 10240 writes per second, as in the RocksDB report. (This humble little laptop could not sustain 81920 writes per second.) The graph shows that BerkeleyDB, TokuDB, and WiredTiger Btree were unable to attain this write speed, let alone exceed it.

The WiredTiger Btree gives an impressive read throughput, giving it a solid second place in the results. None of the other engines are even within an order of magnitude of LMDB's read performance. Graphs with a detailed breakdown of the per-thread throughput are available on the Details page.

Space Used

Finally, the space used by each engine is illustrated in this graph.

The Load Size shows the amount of space in use at the end of the loading process. The Final Size shows the amount used at the end of the test run. Ideally the DB should only be 2.2GB since that is the total size of the 20 million records. Also, since there were no add or delete operations, ideally the Final Size should be the same as the Load Size.

Most of the DB engines (except LMDB) have significant space overhead, and unfortunately this graph doesn't even capture the full scope of this overhead - during the run various log files will be growing and being truncated, so the actual space used may be even larger than shown here.

LMDB uses no log files - the space reported here is all the space that it uses.

The Process Size shows the maximum size the test program grew to while running the test. This is another major concern when trying to determine how much system capacity is needed to support a given workload. In this case, all of the engines (except LMDB) require memory for application caching. They were all set to run with 6GB of cache. Both Basho and RocksDB would have used more memory if available; in an earlier run with the cache set to 8GB they both grew past 9GB and caused the machine to start swapping. While most people believe "more cache is better" the fact is that trying to use too much will hurt performance. As always, it takes careful testing and observation to choose a workable cache size for a given workload.

It's not clear to me why any DB engine would need more than 6GB of memory to manage 2GB of actual data. With LMDB's Single-Level-Store, cache size is a non-issue and the engine can never drive a system into swapping. There's no wasted overhead - all of the memory in the system gets applied to your actual application, so you can get more work done with any given hardware configuration than any other database.

3. Larger Data Set

These tests use 100 million records and are run on the 16 core server. Aside from the data set size things are much the same. Here are the tabular results:

Engine	Load Time			Overhead	Load Size	Writes/Sec	Reads/Sec	Run Time			Final Size	CPU%	Process Size
	Wall	User	Sys		KB			Wall	User	Sys	KB		KB
LevelDB	03:06.75	04:41.26	00:42.87	1.7356358768	11273396	9184	7594	01:00:02.00	01:22:11.46	01:52:10.46	13734168	323%	3284192
Basho	04:22.96	11:09.24	02:18.93	3.0733571646	11449492	10211	80135	01:00:23.00	14:32:23.67	00:11:49.40	13841220	1464%	19257796
BerkeleyDB	14:59.45	13:34.30	01:25.15	1	28381956	3378	55066	01:00:02.00	03:02:00.69	12:42:39.63	28387880	1573%	14756768
Hyper	03:43.61	05:41.14	00:39.02	1.7001028577	11280092	10231	11673	01:00:04.00	01:59:42.09	01:53:24.27	15149416	387%	6332460
LMDB	01:04.15	00:52.31	00:11.82	0.9996882307	12605332	10230	2486800	00:11:14.14	02:47:58.57	00:00:10.06	12627692	1598%	12605788
RocksDB	02:28.66	03:59.92	00:30.97	1.8222117584	11289688	10232	129397	01:00:22.00	12:08:05.94	02:51:58.54	12777708	1490%	18599544
TokuDB	07:44.10	09:17.31	02:54.82	1.5775263952	12665136	4601	70208	01:00:15.00	03:02:37.44	11:21:45.00	15328956	1434%	23315964
WiredLSM	07:10.50	19:25.80	02:31.10	3.0590011614	12254620	10194	278415	01:00:05.00	15:51:04.17	00:02:09.76	16016296	1586%	17723992
WiredBtree	02:07.49	01:49.52	00:17.97	1	11932620	10145	1320939	00:20:58.10	05:06:13.60	00:05:14.87	23865368	1560%	20743232