How TokuMX Gets Great Compression for MongoDB
In my last post, I showed what a Fractal Tree® index is at a high level. Once again, the Fractal Tree index is the data structure inside TokuMX and TokuDB, our MongoDB and MySQL products. One of its strengths is the ability to get high levels of compression on the stored data. In this post, I’ll explain why that is.
At a high level, one can argue that there isn’t anything special about our compression algorithms. We basically do this: we take large chunks of data, use known compression methods (e.g. zlib, lzma, quicklz), and compress them. That’s it!
As my colleague Martin has said, “Compression can be like a freight train. It’s slow to get started, but given enough track it can pick up steam.” These algorithms work well when compressing large chunks of data, not small chunks. We compress large chunks of data, so we get great compression.
Customers have seen tremendous compression with TokuDB, our MySQL storage engine. For example, one customer saw 9x compression, while another saw roughly 85% savings. How much compression one sees depends on how well the compression algorithms can compress the data. If you have some data in MongoDB or MySQL, you can get an idea for how TokuMX and TokuDB will compress it by dumping or exporting it to flat files, and running gzip on them. We usually see better compression from equivalent MongoDB data because we’re compressing the field names, which are often repeated very frequently, and therefore compress very well.
So the real question is, how do Fractal Tree indexes compress large chunks of data? To answer that let’s dive a little deeper than we did last time into what a Fractal Tree index looks like. Below is a picture of the data structure.
In a B-Tree and Fractal Tree index, most of the data structure (well over 90%) is comprised of leaf nodes. So, to understand why we compress well, we only need to understand how we store leaf nodes. Traditionally, one reason for B-Trees to keep leaf nodes small is to reduce the I/O cost of performing a write. It’s bad enough that we may do a disk seek per write to bring the leaf node into memory. If a B-Tree also needs to pay the cost of disk bandwidth by reading in a large leaf node, this hurts performance more. Fractal Tree indexes, on the other hand, batch many writes per I/O. So, Fractal Tree indexes should have large blocks, on the order of 4MB. The larger the blocks, the more writes we can batch together, and the less each individual write costs us overall. In fact, our leaf nodes, uncompressed, are 4MB. So, the very nature of Fractal Tree indexes favor big block sizes of leaf nodes, which in turn is good for compression.
That, however, is not the whole story. I’m sure some readers are thinking to themselves, “what about reads? Reads matter.” And they are right. Reads do matter. Having a 4MB leaf node is smart for writes, but may be painful for reads. If a query performs a point query (which queries on non-covering secondary indexes do), then with 4MB nodes, a user would scan into memory 4MB of data just to read one measly row. This is painful for the same reasons reading 4MB of data just to write one row is painful. (Note that one difference here is that a disk seek is required regardless of the node size, whereas with writes, the disk seek is not required. So it is not quite the same).
So what do we do? We still have 4MB leaf nodes, but we partition the leaf node into 64KB contiguous chunks called “basement nodes.” These 64KB chunks are individually compressed, concatenated, and written disk to represent a leaf node. When we flush data to a leaf for writes, we read in the full 4MB of a leaf node. The amount flushed is enough data to make reading 4MB of data worthwhile. When reading “one measly row”, we read only the appropriate 64KB chunk of data, which is large enough to still get good compression, but small enough to not incur a huge bandwidth cost when performing a query.
Note that this value of 64KB is configurable. We’ve noticed that 64KB is a nice sweet spot to get good point query performance and compression in typical MySQL and MongoDB scenarios, but for some data sets and workloads, 32KB or 128KB can perform or compress a bit better. In a future blog post will explain the tradeoffs of tweaking this and other settings.
Put these characteristics together, and we have ourselves a data structure that can:
- compress really well
- get wonderful write performance by reading and writing large blocks of data
- get great query performance by having the ability to read small(er) chunks of data
And when you package it all up, you get fantastic MongoDB compression and performance.