In this post we’ll describe a query that accrued significant performance advantages from using a relatively long index key. (This posting is by Zardosht and Bradley.)
We ran across this query recently when interacting with a customer (who gave us permission to post this sanitized version of the story):
SELECT name, Count(e2) AS CountOfe2 FROM (SELECT distinct name, e2 FROM (SELECT atable.NAME AS name, pd1.NAME AS e2 FROM atable INNER JOIN atable AS pd1 ON (atable.id = pd1.id) AND (atable.off = pd1.off) AND (atable.len = pd1.len)) ent WHERE ((ent.name<>ent.e2))) outside GROUP BY outside.name order by CountOfe2 desc;
With a table defined as
CREATE TABLE `atable` ( `id` varchar(25) DEFAULT NULL, `off` bigint(20) DEFAULT NULL, `len` bigint(20) DEFAULT NULL, `name` varchar(1000) DEFAULT NULL, `x` bigint(20) DEFAULT NULL, `y` bigint(20) DEFAULT NULL, `pid` varchar(25) DEFAULT NULL, `typ` varchar(25) DEFAULT NULL, KEY `nameIDX` (`name`), KEY `typIDX` (`typ`), KEY `idIDX` (`id`), KEY `lenIDX` (`len`), KEY `offIDX` (`off`), KEY `pidIDX` (`pid`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1; SET character_set_client = @saved_cs_client;
The actual table has about 30,000 rows, so it fits in main memory.
We concluded that with an index better suited for this query, we can speed up this query using either MyISAM and TokuDB’s storage engine for MySQL. However, TokuDB’s indexing flexibility provides about a 3x performance advantage over MyISAM.
All experiments were run on a Core 2 duo 2.4Ghz machine with 3GB RAM running Linux.
Here is what we did.
First, we ran the query as-is, to see the difference between TokuDB and MyISAM:
- MyISAM ran the query in 98s.
- TokuDB ran the query in 23.7s.
Next, we added an index to atable to make the inner join run faster. For TokuDB, we added a covering index comprising (id, off, len, name). For MyISAM, we added an index comprising (id, off, len), which covers the predicate, but not the whole query. We could not add name to MyISAM’s index, because MyISAM’s key size limit would not allow it. The customer had defined the length of the name field, so shortening it didn’t seem feasible. The new query times were:
- MyISAM: 17.149s
- TokuDB: 17.4s
which helped MyISAM a lot, and helped TokuDB a little.
To get more speedup, we observed that this query generates two temporary tables for the following parts:
- select distinct name, e2 ...
- select name, count(e2) as CountOfe2 ...
These temporary tables do not have a good index of (name, e2). If the temporary tables had this index, we could speed up the query even further. To do this, we transformed the query into:
create temporary table tmp_prox (name varchar(1000), e2 varchar(1000), key (name, e2))engine=tokudb; create temporary table tmp_prox2 (name varchar(1000), e2 varchar(1000), key (name, e2))engine=tokudb; insert into tmp_prox ( SELECT atable.NAME AS name, pd1.NAME AS e2 FROM atable INNER JOIN atable AS pd1 ON (atable.id = pd1.id) AND (atable.off = pd1.off) AND (atable.len = pd1.len) ); insert into tmp_prox2 ( SELECT distinct name, e2 from tmp_prox where ((name<>e2)) ); SELECT name, Count(e2) AS CountOfe2 FROM tmp_prox2 GROUP BY name order by CountOfe2 desc;
We could not do this with MyISAM, because the index (name, e2) exceeded MyISAM’s maximum index size. Running these statements with TokuDB took 4.9 seconds.
Thus, a covering index speeds up MyISAM by a factor of 5.7, and TokuDB’s indexing flexibility provided an additional 3x speedup. This transformation isn’t really specific to TokuDB. TokuDB really differentiates itself from other storage engines when maintaining indexes that don’t fit in main memory. This table fits in main memory, so transformation would probably work on some other storage engines that accept bigger index key lengths than does MyISAM. When the table does get bigger, TokuDB’s Fractal Tree® indexes can create and maintain those various indexes on cheap hardware even at high data arrival rates.