Are You Forcing MySQL to Do Twice as Many JOINs as Necessary?
|This guest post is from our friends at Percona. They’re hosting Percona Live London from October 24-25, 2011. Percona Live is a two day summit with 100% technical sessions led by some of the most established speakers in the MySQL field.
In the London area and interested in attending? We are giving away two free passes in the next few days. Watch our @tokutek twitter feed for a chance to win.
Did you know that the following query actually performs a JOIN? You can’t see it, but it’s there:
SELECT the_day, COUNT(*), SUM(clicks), SUM(cost) FROM ad_clicks_by_day WHERE the_day >= '2005-07-01' AND the_day < '2005-07-07' GROUP BY the_day;
Let me explain.
Suppose you define the table as follows:
CREATE TABLE ad_clicks_by_day ( customer INT NOT NULL, the_day DATE NOT NULL, clicks INT NOT NULL DEFAULT 0, cost INT NOT NULL DEFAULT 0, PRIMARY KEY(customer, the_day), INDEX(the_day) ) ENGINE=InnoDB;
What happens when MySQL executes this query? Here's the EXPLAIN:
*************************** 1. row *************************** id: 1 select_type: SIMPLE table: ad_clicks_by_day type: range possible_keys: the_day key: the_day key_len: 3 ref: NULL rows: 368 Extra: Using where
That looks fine, doesn't it? It's using an index range scan to find approximately 368 matching rows, and adding up the clicks and cost for them. Can it be done any more efficiently than this?
In fact, it turns out that it's possible to execute this query much more efficiently. What happens internally when the server executes this query is that it begins reading from the 'the_day' index, and for each row it finds, it performs a lookup in the primary key (which, in InnoDB, stores the whole table) to find the other columns mentioned in the query. That is, it's joining the index to the primary key! This is not just an ivory-tower abstraction. In real-life workloads, the random I/O caused by such index-to-table joins slows queries dramatically.
The “join” here isn’t really the same as a true table-to-table SQL JOIN in some ways, and it has a little less overhead at the server level than a table-to-table join, but in terms of the way an index-to-row lookup works, it’s quite similar in many ways.
If you're skilled at logical and physical database design, you already noticed that my example can be improved by indexing differently: just put the primary key on (the_day, customer) instead of the other way around. Then the query will use the primary key instead of the secondary key, and the whole row is available to the query without doing a join to another index. This is called a covering index, which means that the index covers the whole query--there is no need for any data outside the index. You'll see "Using index" in the Extra column from EXPLAIN when an index covers a query.
But then what if we want to group the results by customer? No problem, we can just add another index. To achieve the same results for ad-hoc querying, we'll need to index every column:
ALTER TABLE ad_clicks_by_day ADD INDEX(the_day, customer, clicks, cost);
That works, but isn't that going to be kind of expensive? We're essentially duplicating the whole table. And what if we have some other columns we want to target for grouping, or filtering, or sorting, or any other operation that can be optimized with an index? Uh-oh, that’s going to be a lot of indexes.
The overhead of maintaining indexes is often mentioned, but rarely measured. It is sometimes exaggerated as a result. (For further reading on this point, see the book Relational Database Index Design and the Optimizers by Tapio Lahdenmaki.) However, if you did measure the overhead, you would notice two important things: increased disk I/O for modifications to the table, and increased disk space consumption. Even if it's sometimes blown out of proportion, it's real.
That's why I like TokuDB's indexing technology and modifications to the storage engine API. Indexes are much cheaper in TokuDB than they are in InnoDB, for several reasons:
- TokuDB compresses its data quite well, which reduces the footprint both in space consumption and in I/O operations.
- TokuDB lets you define multiple clustering indexes--indexes that sort and compare only on the key columns, but which include all columns in the table, transparently. That avoids invisible JOINs. I want this feature in InnoDB.
- TokuDB uses a different data structure for indexes, which is significantly more efficient. Instead of the classic B-Tree structure, TokuDB uses Fractal Trees, which don't slow down when they get bigger than memory.
As a result of these three properties of TokuDB, you can maintain a lot of indexes on a lot of data relatively cheaply. And that means that you can index your data to match your query patterns, and get much faster performance by removing a lot of random I/O operations caused by finding data and then looking up the rest of the row.
Disclosure: Tokutek paid Percona to evaluate the technology, but it is our practice to offer only our honest opinion in blog posts such as this one.
About the Author
Baron is Percona's Chief Performance Architect. He's the lead author of High Performance MySQL, and creator of Maatkit. He consults with Percona's customers, and develops tools and practices for Percona's team.