Symas Corp., July 2012
This page follows on from Google's LevelDB benchmarks published in July 2011 at LevelDB. (A snapshot of that document is available here for reference. In addition to the benchmarks tested there, we add the venerable BerkeleyDB as well as the OpenLDAP MDB database. For this test, we compare LevelDB version 1.5 (git rev dd0d562b4d4fbd07db6a44f9e221f8d368fee8e4), SQLite3 (version 3.7.7.1) and Kyoto Cabinet's (version 1.2.76) TreeDB (a B+Tree based key-value store), Berkeley DB 5.3.21, and OpenLDAP MDB (git rev a0993354a603a970889ad5c160c289ecca316f81). We would like to acknowledge the LevelDB project for the original benchmark code.
Benchmarks were all performed on a Dell Precision M4400 laptop with a quad-core Intel(R) Core(TM)2 ExtremeCPU Q9300 @ 2.53GHz, with 6144 KB of total L3 cache and 8 GB of DDR2 RAM at 800 MHz. (Note that LevelDB uses at most two CPUs since the benchmarks are single threaded: one to run the benchmark, and one for background compactions.) The benchmarks were run on two different filesystems, one with a tmpfs and one with reiserfs on an SSD. The SSD is a relatively old model, Samsung PM800 Series 256GB. The system had Ubuntu 12.04 installed, with kernel 3.2.0-26. Tests were all run in single-user mode to prevent variations due to other system activity. CPU performance scaling was disabled (scaling_governor = performance), to ensure a consistent CPU clock speed for all tests. The numbers reported below are the median of three measurements. The databases are completely deleted between each of the three measurements.
Update: Additional tests were run on a Western Digital WD20EARX 2TB SATA hard drive. The HDD results start in Section 8. The results across multiple filesystems are in Section 11.
We wrote benchmark tools for SQLite, BerkeleyDB, MDB, and Kyoto TreeDB based on LevelDB's db_bench. The LevelDB, SQLite3, and TreeDB benchmark programs were originally provided in the LevelDB source distribution but we've made additional fixes to the versions used here. The code for each of the benchmarks resides here:
Most database vendors claim their product is fast and lightweight. Looking at the total size of each application gives some insight into the footprint of each database implementation.
size db_bench* text data bss dec hex filename 271991 1456 320 273767 42d67 db_bench 1682579 2288 296 1685163 19b6ab db_bench_bdb 96879 1500 296 98675 18173 db_bench_mdb 655988 7768 1688 665444 a2764 db_bench_sqlite3 296244 4808 1080 302132 49c34 db_bench_tree_dbThe core of the MDB code is barely 32K of x86-64 object code. It fits entirely within most modern CPUs' on-chip caches. All of the other libraries are several times larger.
This section gives the baseline performance of all the databases. Following sections show how performance changes as various parameters are varied. For the baseline:
LevelDB has the fastest write operations. MDB has the fastest read operations by a huge margin, due to its single-level-store architecture. MDB was written for OpenLDAP; LDAP directory workloads tend to be many reads/few writes, so read optimization is more critical for that workload than writes. LevelDB is oriented towards many writes/few reads, so write optimization is emphasized there.
A batch write is a set of writes that are applied atomically to the underlying database. A single batch of N writes may be significantly faster than N individual writes. The following benchmark writes one thousand batches where each batch contains one thousand 100-byte values. TreeDB does not support batch writes so its baseline numbers are repeated here for reference.
Because of the way LevelDB persistent storage is organized, batches of random writes are not much slower (only a factor of 1.6x) than batches of sequential writes. MDB has a special optimization for sequential writes, which is most effective in batched operation.
In the following benchmark, we enable the synchronous writing modes of all of the databases. Since this change significantly slows down the benchmark, we stop after 10,000 writes. Unfortunately the resulting numbers are not directly comparable to the async numbers, since overall database size is also a factor in write performance and the resulting databases here are much smaller than the baseline.
In both LevelDB and TreeDB the fact that operations are synchronous outweighs the fact that the database is much smaller than the baseline. TreeDB in particular performs extremely poorly in synchronous mode. On random writes, for SQLite3, MDB, and BerkeleyDB the smaller database size completely negates the cost of the synchronous writes.
We increased the overall cache size for each database to 128 MB. For SQLite3, we kept the page size at 1024 bytes, but increased the number of pages to 131,072 (up from 4096). For TreeDB, we also kept the page size at 1024 bytes, but increased the cache size to 128 MB (up from 4 MB). For MDB there is no cache, so the numbers are simply a copy of the baseline. Both MDB and BerkeleyDB use the default system page size (4096 bytes).
For this benchmark, we use 100,000 byte values. To keep the benchmark running time reasonable, we stop after writing 1000 values. Otherwise, all of the same tests as for the Baseline are run.
MDB's single-level-store architecture clearly outclasses all of the other designs; the others barely even register on the results. MDB's zero-memcpy reads mean its read rate is essentially independent of the size of the data items being fetched; it is only affected by the total number of keys in the database.
TreeDB has very good performance with large values using asynchronous writes. It has much worse performance in synchronous mode. Batch mode appears to have no benefit with large values; the work of writing the values cancels out the efficiency gained from batching. MDB has additional features to handle large values but the current benchmark code doesn't support it.
The same tests as in Section 2 are performed again, this time using the Samsung SSD with reiserfs. This drive has been in regular use over the past several years and was not reformatted for the tests. It has very poor random write speed as a result.
Read performance is essentially the same as for tmpfs since all of the data is present in the filesystem cache.
Most of the databases perform at close to their tmpfs speeds, which is expected since these are asynchronous writes. However, BerkeleyDB shows a large reduction in throughput.
Here the difference between SSD and tmpfs is made obvious.
The slowness of the SSD overshadows any difference between sequential and random write performance here.
We increased the overall cache size for each database to 128 MB, as in Section 3. The "baseline" in these tests refers to the values from Section 5.
This is the same as the test in Section 4, using the SSD.
The read results are about the same as for tmpfs.
As before, TreeDB's write performance is good on asynchronous writes. BerkeleyDB's performance degrades the least in synchronous mode.
The same tests as in Section 2 are performed again, this time using the Western Digital WD20EARX HDD with EXT3 fs. The drive was attached to the laptop's eSATA port, so interface bottlenecks are not an issue. The MDB library used here is a littler newer than the previous tests, using revision 5da67968afb599697d7557c13b65fb961ec408dd which results in faster sequential write rates than in the previous tests so those numbers are not directly comparable.
Note that this data does not represent the maximum performance that the drive is capable of. For completeness, the tests were repeated on multiple other filesystems including EXT2, EXT3, EXT4, JFS, XFS, NTFS, ReiserFS, BTRFS, and ZFS. Those results will be uploaded later.
This drive uses 4KB physical sectors. The drive was partitioned into two 1TB partitions, 4KB aligned. The first partition was formatted with NTFS. The 2nd partition was reused with each of the other filesystems.
Read performance is essentially the same as the previous tests since all of the data is present in the filesystem cache. LevelDB and BerkeleyDB are slightly slower than before.
Kyoto Cabinet performs close to its tmpfs speed, while the other databases show more of a reduction in throughput. BerkeleyDB slows down the most.
As slow as the SSD was, the HDD results are even slower.
Note however, that further investigation shows that these results are nowhere near the maximum performance of the HDD. More details on this in Section 11.
The slowness of the HDD overshadows any difference between sequential and random write performance here. None of these systems are suitable for real-world use in this configuration, but Kyoto Cabinet is by far the worst. If an application demands full ACID transactions, Kyoto Cabinet should definitely be avoided.
We increased the overall cache size for each database to 128 MB, as in Section 3. The "baseline" in these tests refers to the values from Section 8.
This is the same as the test in Section 4, using the HDD.
Again, the read results are about the same as for tmpfs.
The slowness of the HDD makes most of the database implementations perform about the same. As before, kyoto Cabinet is much slower than the rest.
The baseline test was repeated on the same HDD, but using a different filesystem each time. The filesystems tested are btrfs, ext2, ext3, ext4, jfs, ntfs, reiserfs, xfs, and zfs. In addition, the journaling filesystems that support using an external journal were retested with their journal stored on a tmpfs file. These were ext3, ext4, jfs, reiserfs, and xfs. Testing in this second configuration shows how much overhead the filesystem's journaling mechanism imposes, and how much performance is lost by using the default internal journal configuration.
Note: storing the journal on tmpfs was just for the purposes of this test. In a real deployment you would need to store the journal on an actual storage device, like a separate disk, otherwise the filesystem would be lost after a reboot.
The filesystems are created fresh for each test. The tests are only run once each due to the great length of time needed to collect all of the data. (It takes several minutes just to run mkfs for some of these filesystems.) The full results are not presented in HTML here; you will have to download the Spreadsheet to view the results.
You can display the results for a specific benchmark operation across all the filesystem types using the selector in cell B23 of the sheet. Likewise, you can display the results for a specific filesystem across all the benchmark operations using the selector in cell B1, but because the results are so totally dominated by MDB read performance, this view isn't quite as informative.
Just to summarize, jfs with an external journal is the fastest for synchronous writes. If your workload demands fully synchronous transactions, this is clearly the best choice. Otherwise, the original ext2 filesystem is fastest for asynchronous writes.
The raw data for all of these tests is also available. tmpfs, SSD, and HDD. The results are also tabulated in an OpenOffice spreadsheet for further analysis here. The raw filesystem test results are in out.hdd.tar.gz