Intel Optane SSD Microbenchmark

Symas Corp., August 2018

Following on our previous LMDB benchmarks, we recently had an opportunity to test on some Intel Optane hardware. In this set of tests we're using LMDB and RocksDB. (Other variants of LevelDB don't support ACID transactions, so they're not in the same class of DB engine anyway.) Also, instead of reusing the LevelDB benchmark code that we ported to other DB engines, we're now using our own functionally equivalent rewrite of the benchmarks in plain C.

Since the point of these tests is to explore the performance of the Optane SSDs, the tests are configured much like the previous ondisk benchmark, using a database approximately 5x larger than RAM, to minimize the impact of caching in RAM and force the storage devices to be exercised. However, there are some twists to this as well: The Optane SSDs on NVMe can also be operated as if they were system RAM. The Optane technology still has higher latency than DRAM, but as we'll see, there's still a performance benefit to using this mode.

The hardware for these tests was graciously provided by our friends at Packet and system support was provided by Intel. The machine was based on an Intel S2600WFT motherboard with a pair of 16 core/32 thread Intel Xeon Gold 6142 processors and 192GB DDR4-2666 DRAM. Storage being tested included a 4 TB DC P4500 TLC NAND-Flash SSD and three 750GB DC P4800X Optane SSDs. The machine had Ubuntu 16.04 installed, with a 4.13.0-41-generic kernel. The software versions being used are LMDB 0.9.70 and RocksDB 5.7.3, both compiled from their respective git repos. (Note that LMDB 0.9.70 is the revision in the mdb.master branch, not an officially released version. The main difference is the addition of support for raw devices.)

Test Overview

Prior tests have already illustrated how performance varies with record sizes. In these tests we're strictly interested in the relative performance across the different storage types so we're only testing with a single record size. We're using the ext4 filesystem in these tests, configured once with journaling enabled and once with journaling disabled. Each test begins by loading the data onto a freshly formatted filesystem. We use a 750GB partition on the 4TB Flash SSD, to ensure that the filesystem metadata overhead is identical on the Flash and Optane filesystems. Additionally, we test LMDB on raw block devices, with no filesystem at all, to explore how much overhead the filesystems impose. RocksDB doesn't support running on raw block devices, so it is omitted from those tests.

The test is run using 80 million records with 16 byte keys and 4000 byte values, for a target DB size of around 300GB. The system is set so that only 64GB RAM is available during the test run. After the data is loaded a readwhilewriting test is run multiple times in succession. The number of reader threads is set to 1, 2, 4, 8, 16, 32, and 64 threads for each successive run. (There is always only a single writer.) All of the threads operate on randomly selected records in the database. The writer performs updates to existing records; no records are added or deleted so the DB size should not change much during the test. The results are detailed in the following sections.

Loading the DB

Here are the stats collected from initially loading the DB for the various storage configurations.

Storage Load Time CPU DB Size Context Switches FS Ops

LMDB Journal Wall User Sys % KB Vol Invol In Out Write Amp

Flash/Ext4 Y 11:50.91 01:15.70 09:40.36 92 322683976 5910595 1303 2640 840839736 10.5104967
Flash/Ext4 N 13:21.04 01:16.69 11:01.86 92 322683976 8086767 1241 3696 946659568 11.8332446
Flash N 17:25.23 03:29.26 04:11.36 44 80669411 1346 645369800 645487344 8.0685918
Optane/Ext4 Y 14:20.99 01:12.78 12:09.88 93 322683976 9991458 1170 552 928896808 11.6112101
Optane/Ext4 N 15:11.10 01:16.72 12:49.09 92 322683976 10487638 1377 1080 1029364408 12.8670551
Optane N 20:26.19 03:30.62 03:55.97 36 80670953 1305 645367344 645547472 8.0693434
RocksDB Journal Wall User Sys % KB Vol Invol In Out

Flash/Ext4 Y 15:00.44 13:01.27 11:45.63 165 318790584 231768 3184 11400 1265319232 15.8164904
Flash/Ext4 N 14:30.45 12:53.43 10:46.62 163 318790584 215318 2786 11016 1265362424 15.8170303
Optane/Ext4 Y 02:13:40.00 13:51.74 11:14.07 18 318790328 339737 7549 11088 1265319000 15.8164875
Optane/Ext4 N 02:13:40.00 13:47.29 10:49.81 18 318790328 337922 7598 11256 1265364360 15.8170545

Storage	Load Time	CPU	DB Size	Context Switches	FS Ops
LMDB	Journal	Wall	User	Sys	%	KB	Vol	Invol	In	Out	Write Amp
Flash/Ext4	Y	11:50.91	01:15.70	09:40.36	92	322683976	5910595	1303	2640	840839736	10.5104967
Flash/Ext4	N	13:21.04	01:16.69	11:01.86	92	322683976	8086767	1241	3696	946659568	11.8332446
Flash	N	17:25.23	03:29.26	04:11.36	44		80669411	1346	645369800	645487344	8.0685918
Optane/Ext4	Y	14:20.99	01:12.78	12:09.88	93	322683976	9991458	1170	552	928896808	11.6112101
Optane/Ext4	N	15:11.10	01:16.72	12:49.09	92	322683976	10487638	1377	1080	1029364408	12.8670551
Optane	N	20:26.19	03:30.62	03:55.97	36		80670953	1305	645367344	645547472	8.0693434
RocksDB	Journal	Wall	User	Sys	%	KB	Vol	Invol	In	Out
Flash/Ext4	Y	15:00.44	13:01.27	11:45.63	165	318790584	231768	3184	11400	1265319232	15.8164904
Flash/Ext4	N	14:30.45	12:53.43	10:46.62	163	318790584	215318	2786	11016	1265362424	15.8170303
Optane/Ext4	Y	02:13:40.00	13:51.74	11:14.07	18	318790328	339737	7549	11088	1265319000	15.8164875
Optane/Ext4	N	02:13:40.00	13:47.29	10:49.81	18	318790328	337922	7598	11256	1265364360	15.8170545

The "Wall" time is the total wall-clock time taken to run the loading process. Obviously shorter times are faster/better. The actual CPU time used is shown for both User mode and System mode. User mode represents time spent in actual application code; time spent in System mode shows operating system overhead where the OS must do something on behalf of the application, but not actual application work. In a pure RAM workload where no I/O occurs, ideally the computer should be spending 100% of its time in User mode, processing the actual work of the application. Since this workload is 5x larger than RAM, it's expected that a significant amount of time is spent in System mode performing actual I/O.

The "CPU" column is the ratio of adding the User and System time together, then dividing by the Wall time, expressed as a percentage. This shows how much work of the DB load occurred in background threads. Ideally this value should be 100, all foreground and no background work. If the value is greater than 100 then a significant portion of work was done in the background. If the value is less than 100 then a significant portion of time was spent waiting for I/O. When a DB engine relies heavily on background processing to achieve its throughput, it will bog down more noticeably when the system gets busy. I.e., if the system is already busy doing work on behalf of users, there will not be any idle system resources available for background processing.

The "Context Switches" columns show the number of Voluntary and Involuntary context switches that occurred during the load. Voluntary context switches are those which occur when a program calls a function that can block - system calls, mutexes and other synchronization primitives, etc. Involuntary context switches occur e.g. when a CPU must handle an interrupt, or when the running thread's time slice has been fully consumed. LMDB issues write() system calls whenever it commits a transaction, so there are a lot of voluntary context switches here. However, not every write() results in a context switch - this depends largely on the behavior of the OS filesystem cache. RocksDB is configured with a large cache (32GB, one half of available RAM) as well as a large write buffer (256MB) so it has far fewer voluntary context switches. But since this workload is dominated by I/O, the CPU overhead of LMDB's context switches has little impact on the overall runtime.

The "FS Ops" columns show the number of actual I/O operations performed, which is usually different from the number of DB operations performed. Since the loading task is "write-only" we would expect few, if any, input operations. However, since the DB is much larger than RAM, it's normal for some amount of metadata to need to be re-read during the course of the run, as the written data pushes other information out of the filesystem cache. The number of outputs is more revealing, as it directly shows the degree of write amplification occurring. There are only 80 million DB writes being performed, but there are far more than 80 million actual writes occurring in each run. The results with the raw block device shows that the filesystem adds 25% more writes than the DB itself.

There are a few unexpected results here. The LMDB loads actually ran slower with the filesystem journal turned off. Also, the LMDB loads on the raw block device also ran slower than with a filesystem. The I/O statistics imply that the block device wasn't caching any of the device reads. RocksDB has a serious performance issue on the Optane filesystems, taking over 2 hours to load the data. There's no explanation for that yet.
Here's the load times plotted again, without the 2 hour outliers.

With LMDB on the raw block device, each write of a record results in an immediate write to the device, which always causes a context switch. So for 80 million records there are at least 80 million voluntary context switches. In general, even though this is a purely sequential workload, RocksDB performs more filesystem writes per database write than LMDB, and usually more filesystem reads. The latter is somewhat surprising because LSM-based designs are supposed to support "blind writes" - i.e., writing a new record shouldn't require reading any existing data - that's supposed to be one of the features that makes them "write-optimized." This LSM advantage is not in evidence here.

Overall, the specs for the Optane P4800X show 11x more random write IOPS and faster latency than the Flash P4500 SSD, but all of the load results here are slower for the P4800X than for the Flash SSD. Again, we have no explanation for why the results aren't more reflective of the drive specs. At a guess, it may be due to wear on the SSDs from previous users. It was hoped that doing a fresh mkfs before each run, which also explicitly performed a Discard Blocks step on the device, would avoid wear-related performance issues but that seems to have had no effect.

Throughput

The results for running the actual readwhilewriting test with varying numbers of readers are shown here.

Write throughput for RocksDB is uniformly slow, regardless of whether using the Flash or Optane SSD. In contrast, LMDB shows the performance difference that Optane offers, quite dramatically, with peak random write throughputs up to 3.5x faster on Optane than on Flash. Using the raw block device also yields a slightly higher write throughput than using the ext4 filesystem.

The difference in read throughput between Flash and Optane isn't so great at the peak workload of 64 reader threads, but there are more obvious differences at the greater numbers of threads. With LMDB on Flash, doubling the number of reader threads essentially doubles throughput, except at 64 readers where the increase is much smaller. The way the results bunch up at thread counts of 8 or more for LMDB on Optane imply that the I/O subsystem gets bottlenecked, and there's no headroom for further doubling. RocksDB's peak is still about the same (or slightly slower) on Optane as on Flash, and still slower than LMDB.

Conclusion

When using LMDB, the LMDB engine will never be the bottleneck in your workloads. When you move onto faster storage technologies, LMDB will let you utilize the full potential of that hardware. Inferior technologies like LSM designs won't.

Test results using the Optane SSD as RAM are on the next page.

Files

The files used to perform these tests are all available for download.

90318154 Jul 27 01:28 data.tgz Command scripts, output, atop record
LibreOffice spreadsheet with tabulated results here. The source code for the benchmark drivers is all on GitHub. We invite you to run these tests yourself and report your results back to us.