Partitioning, Free Lunches, & Indexing, Part 2
In part one, I presented a very brief and particular view of partitioning. I covered what partitioning is, with hardly a mention of why one would use partitioning. In this post, I’ll talk about a few use cases often cited as justification for using partitions.
Lots of disks → Lots of partitioning of tables
One use case for justifying partitions is that each partition can be placed on a separate disk to avoid spindle contention. I have to say that on this one, I agree with Kevin Burton, who makes the point that if you want to distribute I/O load across several disks, you can use a RAID configuration on the disks. In this case, he says that partitioning is not worth the trouble. [NB. He makes the point that this is a problem with MySQL's implementation of partitions and its interaction with MySQL Cluster. He proposes some fixes. If those fixes ever come to pass, we'll have to see what the new performance characteristics look like.]
I spent some time trying to find comparisons between partitioning and RAIDing for load balancing in MySQL. If I missed something, I’d love to know about it, but the results aren’t jumping out.
In summary, if you’re going to claim that a partitioning scheme helps load balance across disks, it makes sense to compare this with a well-designed RAID system. Partitioning is high maintenance. RAID is low maintenance by comparison. I may be lazy, but I’d rather the system did the load balancing and not me.
Avoiding Table Scans
The most commonly claimed performance improvement for partitioning, from what I can tell, is their use in avoiding full table scans during queries. Even though the details vary from case to case, I’ve been able to distill one scenario that illustrates what’s going on most of the time:
- A query on data causes a table scan.
- Adding a secondary index doesn’t help, because the secondary index does not cover the query. The non-covering index would induce a large number of point queries into the primary table. This is slow enough that the query optimizer chooses a primary table scan instead. (Why not build the secondary index as covering? Probably because secondary indexes can be expensive to build in InnoDB. Perhaps the queries are date+user_id, then date+ip, then date+…, so you’d need a lot of secondary indexes; or even more directly, you know what secondary indexes you need, but InnoDB can’t keep up with the insertion load when you turn them all on. The first problem is solved by using TokuDB clustering indexes. The second is also solved by TokuDB, because it’s so fast at indexing.)
- The table is partitioned by date. Now, when a query induces a table scan, one or a few partitions are scanned, instead of the whole table.
The table didn’t get any smaller. We were just able to restrict a table scan to a smaller portion of the table. For me, this helps place partitions into a context of well-known DB techniques. Partitions are often a replacement for covering indexes!
After all, that’s what covering indexes are, right? They are arrangements of data that allow you to scan part of a table to answer a query, without resorting to point queries into a primary table.
If partitions are indexes, are they good ones?
Once again, I looked for comparisons of proposed partitioning schemes with indexing-based choices. In particular, I looked for comparisons of partitioning with covering schemes. No luck. And as before, if I missed something I’d love to hear about it.
If partitions are designed to avoid table scans, then they are a high-effort way to do so. I’m not suggesting that they should never be used for this, but unless the low maintenance option of covering and clustering indexing is considered, partitioning just feels like a lot of work, potentially for no benefit.
The common reasons for using partitioning have low-maintenance, high-performance alternatives. In the case of load balancing across disks, I’d rather use RAID than partition by hand. In the case of avoiding table scans, I’d rather design indexes that avoid table scans than partition by hand. If I can’t get the insertion performance I need once I’ve designed a sensible set of indexes for my query load, I’d rather use a storage engine that does fast indexing (like TokuDB) so I can maintain the indexes I need, rather than the ones I’m forced to by the storage engine.
And in any case, it’s important to compare partitioning to the most effective alternative solutions for the performance problem encountered.