Avoiding Fragmentation with Fractal Trees
B-trees suffer from fragmentation. Fragmentation causes headaches — in query performance and space used. Solutions, like dump and reload or OPTIMIZE TABLE are a pain and not always effective. Fractal trees don’t fragment. So if fragmentation is a problem, check out Tokutek
What is fragmentation?
What do I mean when I say “fragmentation”? People complain about two things when they talk about index fragmentation. They either find that the disk spaced used is much larger than the data, or they complain about query performance. In particular, they complain of range query performance, since point queries aren’t really affected by fragmentation. I’m going to focus on this second symptom, which is due to lack of locality of reference amongst the rows.
In the following, I’ll be talking about the limits of B-trees. Not InnoDB in particular, but B-trees in general. InnoDB’s B-tree, and MyISAM’s, and SQL Server’s, and Oracle’s, and …
Consider two types of B-trees: clustered and non-clustered. In the MySQL world, we can think of these and InnoDB and MyISAM. Since I’m talking about locality of reference amongst rows, I’ll treat these separately.
A fragmented non-clustering B-trees is just about as fragmented as you can get. If you insert data in an arbitrary order, the logical order of the rows will be unrelated to the physical placement on disk. Scanning through a chunk of rows will cause the disk head to move around for each row (in the worst case). Non-clustering B-trees are generally not a good choice for range queries.
Clustering B-trees place groups of logically close rows into leaf blocks together. That means that when you fetch a leaf during a range query, you get a bunch of rows that satisfy your WHERE clause, not just one row. Thus, range queries on a clustering B-tree are faster. (Ok, it’s possible to concoct a case where MyISAM does range queries faster than InnoDB, so insert appropriate caveats here, but these are corner cases I’m going
Inserting into a clustering B-tree is slower than inserting into a non-clustering B-tree. B-trees have a tradeoff between insertion and fragmentation. Important note: this tradeoff isn’t inherent in the indexing problem. It’s a tradeoff that B-trees have.
So can we do even better on range queries — and have even less fragmentation than a clustering B-tree? Yes. If you keep the leaves in order on disk and nicely packed, then range queries go even faster — up to two orders of magnitude faster. In the case of range query performance, fragmentation is synonymous with scattered leaves. And defragmentation is synonymous with getting the leaves back into order.
The most common forms of defragmentation are OPTIMIZE TABLE, though Baron Schwartz
points out that InnoDB defragments the primary key when doing an optimize table. Secondary tables do not get defragmented. In fact, they may end up more fragmented. Alternatively, you can regularly dump and reload. And finally, you could implement a B-tree with defragment-as-you-go heuristics, but these slow down insertions considerably (there’s that pesky tradeoff) and ultimately can’t keep up with a heavily loaded database.
Everything I’ve been saying is about B-trees in general, not just the ones available for MySQL. For example, in a sample Dell/SQL Server reference configuration, a disproportionate amount of hardware is spent on making index rebuilding fast. Why? Because B-trees fragment.
Fractal trees don’t fragment
So all I’ve said so far is true of B-trees. But that doesn’t mean it’s true of all indexing structures. Fractal trees don’t fragment. They can’t fragment. That means the primary table isn’t fragmented, and neither are the secondary indexes. That means they aren’t fragmented if you bulk load or if you trickle load. In short, if fragmentation is an issue for you, Fractal trees are worth looking at as a solution. Fractal trees are available in TokuDB , a MySQL storage engine.
The existence of fractal trees also illustrates that there is no inherent tradeoff between fragmentation and insertion speed. Fractal trees are much faster than B-trees at insertions — up to two orders of magnitude faster — and don’t fragment. So B-trees sit on a tradeoff curve, but not the best possible tradeoff curve.
Fragmentation is a complicated matter, and I don’t claim to have said anything exhaustive about it. For example, fragmentation can cause B-tree tables to take up too much disk space, and defragmentation can cause B-tree tables to shrink, although this is a matter of some dispute for InnoDB. Also, it’s not actually that easy to measure fragmentation in a B-tree. My recommendation is to use an index that doesn’t fragment!