Avoiding Fragmentation with Fractal Trees

Posted On November 17, 2010 | By Martin Farach-Colton | 9 comments

Summary

B-trees suffer from fragmentation. Fragmentation causes headaches — in query performance and space used. Solutions, like dump and reload or OPTIMIZE TABLE are a pain and not always effective. Fractal trees don’t fragment. So if fragmentation is a problem, check out Tokutek

What is fragmentation?

What do I mean when I say “fragmentation”? People complain about two things when they talk about index fragmentation. They either find that the disk spaced used is much larger than the data, or they complain about query performance. In particular, they complain of range query performance, since point queries aren’t really affected by fragmentation. I’m going to focus on this second symptom, which is due to lack of locality of reference amongst the rows.

B-trees fragment

In the following, I’ll be talking about the limits of B-trees. Not InnoDB in particular, but B-trees in general. InnoDB’s B-tree, and MyISAM’s, and SQL Server’s, and Oracle’s, and …

Consider two types of B-trees: clustered and non-clustered. In the MySQL world, we can think of these and InnoDB and MyISAM. Since I’m talking about locality of reference amongst rows, I’ll treat these separately.

A fragmented non-clustering B-trees is just about as fragmented as you can get. If you insert data in an arbitrary order, the logical order of the rows will be unrelated to the physical placement on disk. Scanning through a chunk of rows will cause the disk head to move around for each row (in the worst case). Non-clustering B-trees are generally not a good choice for range queries.

Clustering B-trees place groups of logically close rows into leaf blocks together. That means that when you fetch a leaf during a range query, you get a bunch of rows that satisfy your WHERE clause, not just one row. Thus, range queries on a clustering B-tree are faster. (Ok, it’s possible to concoct a case where MyISAM does range queries faster than InnoDB, so insert appropriate caveats here, but these are corner cases I’m going
to ignore.)

Inserting into a clustering B-tree is slower than inserting into a non-clustering B-tree. B-trees have a tradeoff between insertion and fragmentation. Important note: this tradeoff isn’t inherent in the indexing problem. It’s a tradeoff that B-trees have.

So can we do even better on range queries — and have even less fragmentation than a clustering B-tree? Yes. If you keep the leaves in order on disk and nicely packed, then range queries go even faster — up to two orders of magnitude faster. In the case of range query performance, fragmentation is synonymous with scattered leaves. And defragmentation is synonymous with getting the leaves back into order.

The most common forms of defragmentation are OPTIMIZE TABLE, though Baron Schwartz
points out that InnoDB defragments the primary key when doing an optimize table. Secondary tables do not get defragmented. In fact, they may end up more fragmented. Alternatively, you can regularly dump and reload. And finally, you could implement a B-tree with defragment-as-you-go heuristics, but these slow down insertions considerably (there’s that pesky tradeoff) and ultimately can’t keep up with a heavily loaded database.

Everything I’ve been saying is about B-trees in general, not just the ones available for MySQL. For example, in a sample Dell/SQL Server reference configuration, a disproportionate amount of hardware is spent on making index rebuilding fast. Why? Because B-trees fragment.

Fractal trees don’t fragment

So all I’ve said so far is true of B-trees. But that doesn’t mean it’s true of all indexing structures. Fractal trees don’t fragment. They can’t fragment. That means the primary table isn’t fragmented, and neither are the secondary indexes. That means they aren’t fragmented if you bulk load or if you trickle load. In short, if fragmentation is an issue for you, Fractal trees are worth looking at as a solution. Fractal trees are available in TokuDB , a MySQL storage engine.

The existence of fractal trees also illustrates that there is no inherent tradeoff between fragmentation and insertion speed. Fractal trees are much faster than B-trees at insertions — up to two orders of magnitude faster — and don’t fragment. So B-trees sit on a tradeoff curve, but not the best possible tradeoff curve.

And finally…

Fragmentation is a complicated matter, and I don’t claim to have said anything exhaustive about it. For example, fragmentation can cause B-tree tables to take up too much disk space, and defragmentation can cause B-tree tables to shrink, although this is a matter of some dispute for InnoDB. Also, it’s not actually that easy to measure fragmentation in a B-tree. My recommendation is to use an index that doesn’t fragment!

9 thoughts

  1. Edmar says:

    Awesome read about fragmentation, thanks!

    But if Fractal Trees are always better than B-Trees, why do B-Trees still exist?

    Is indexing through Fractal Trees proprietary to Tokutek in some way? Or is it just too new?

  2. Martin Farach-Colton says:

    Fractal trees are both relatively new and proprietary to Tokutek.

  3. I don’t think “proprietary” does justice to TokuDB. Lots of companies do proprietary software but it usually isn’t anything special in terms of new ideas. Tokutek was founded by the professors who invented fractal trees and then took the bigger step of trying to turn theory into working software.

    I am still trying to figure out how it works. My summary is that it makes writes fast without having a read penalty. Other systems that make writes fast (HBase, BigTable, LSM-trees) have a penalty as key lookups may require reads from multiple files.

  4. Martin Farach-Colton says:

    Thanks, Mark, for the kind words. Yes, Fractal trees break through the B-tree tradeoff between reads and writes. It is a new data structure, which we hold the IP for. The only place to get Fractal trees is from Tokutek.

  5. Josh Berkus says:

    How about an technical explanation of how fractal trees work? I’ve been hearing about how cool they are from the Tokutek folks, but haven’t really gotten a rundown on the implementation. I tried to read the academic papers involved but, well … a bit dense, to say the least.

  6. Martin Farach-Colton says:

    Yes, the academic papers aren’t exactly a fun read. We have some presentations and white papers on http://tokutek.com/technology that explain what we’re up to. You might check out “How TokuDB Fractal Tree Indexes Work” by Bradley Kuszmaul.

  7. [...] B-tree-based storage engines, like InnoDB, are slow to delete data, and deletions fragment the table, with consequences to performance. Dropping a partition makes deletions fast and does not fragment the table. On the other hand, for Fractal Tree storage engines, like TokuDB, deletions are fast — though not as fast as dropping a partition — and Fractal Tree indexes don’t fragment. [...]

  8. [...] is to not fragment your tables in the first place. So what’s the solution? You guessed it: TokuDB tables don’t fragment by design, so this particular chunk of maintenance is not [...]

  9. [...] tables don’t fragment, so OPTIMIZE TABLE doesn’t defragment [...]

Leave a Reply

Your email address will not be published. Required fields are marked *