Basement Nodes: Turning Big Writes into Small Reads
Fast indexing requires the leaves of a Fractal Tree® Index to be big. But some queries require the leaves to be small in order to get any reasonable performance. Basements nodes are our way to achieve these conflicting goals, and here I’ll explain how.
On many occasions, we at Tokutek have pointed out that TokuDB is write optimized, which means TokuDB indexes data much faster than a B-tree solution such as InnoDB. As with any write-optimized data structure, Fractal Tree indexes need to bundle up lots of small writes into a few big writes. Otherwise, there’d be no way to beat a B-tree. So the question is, how big do the writes have to be?
Consider how long it takes to write k bytes to a disk. First, there is the seek time s, which we can assume to be independent of k. Next, once we’ve moved the disk head somewhere, we need to write the bytes, which takes time k/b, if the bandwidth of the disk is b bytes per second. The total time is s + k/b. For very big writes, where k/b > s, this is ok, because the time to write to disk is dominated by the bandwidth cost, which is a natural lower bound for writing. But for k < sb, the seek time dominates, and this is the root of the performance
problem for B-tree indexing.
We can conclude that Fractal Tree leaves need to be size at least sb, which for a typical disk might be 10ms x 100MB/s = 1MB.
One of the nice side effects of storing data in large leaves is that we get great compression. Compression can be like a freight train. It’s slow to get started, but given enough track it can pick up steam. So if you want to compress a few kilobytes, you’ll have a hard time, but compressing larger chunks of data — where the compressor has had enough time to “learn” the distribution of the data — works great. And our customers report compression rates from 5x – 25x. That’s largely a consequence of having large leaves. Compare that to InnoDB, which peaks at about 2.5x.
What about reads?
The nice things about writes is that we get to bundle them up, buffer them, slice and dice them however we want, as long as the queries are correctly answered in the end. But queries are different. If a query only needs to look at one row, it only needs to look at one row, and no amount of algorithmic pyrotechnics is going to change that.
A read for an uncached (more on that later) query is going to incur the seek time, no matter what data structure is used. But what about the bandwidth time? A 1MB leaf might be great for writes, but it takes 10ms to read the leaf off disk, once the disk head has moved. That means that you end up paying a 2x penalty for having large leaves. Instead of paying 10ms to read just the row of interest (with a trivially small bandwidth cost), you pay 10ms to get to the leaf and another 10ms to read the leaf.
So maybe we don’t need to read the whole leaf. If the leaf weren’t compressed, we could, in theory, just seek to the actual location within the leaf of the row of interest and read it. Reading something out of a compressed string isn’t quite so simple, and I really don’t consider it to be feasible.
Basement nodes is how we achieve both design goals. Instead of compressing the whole leaf in one go, we compress it in several pieces. Each piece is called a basement node. Really, they are regions of a Fractal Tree leaf. They are big enough to compress and small enough to have relatively small bandwidth cost. Their default size is 128KB. Another great effect of smaller basement nodes means that fetching a leaf as part of a query will have a smaller effect on
Picking a basement-node size
We make it possible to set the basement-node size when a table is created:
Fractal tree leaves are subdivided into read blocks, in order to speed up point queries. The session variable tokudb_read_block_size controls the target uncompressed size of the read blocks. The units are bytes. The default is 131,072 (128KB). A smaller value favors read performance over write performance. Conversely, a larger value favors write performance. In our benchmarking, we have found that there is little advantage to setting the value below 65,536 (64KB) or above 131,072. The minimum value of this variable is 4096.
For now, our recommendation is to set the tokudb_read_block_size to 128K if you have archival tables or tables where you plan on doing lots of large range queries, and 64K for tables you plan on using for lots of point queries.
A few caveats: All indexes for any given table will have the same basement-node size. It’s not possible to change the basement-node size on the fly for a table that already exists. This requires a dump and reload.
For those who are curious about performance numbers, stay tuned. We’ll be posting benchmarks soon.