Clustering indexes vs. Covering indexes

Posted On May 28, 2009 | By Zardosht Kasheff | 8 comments

Yesterday, I (Zardosht) posted an entry introducing clustering indexes. Here, I elaborate on three differences between a clustering index and a covering index:

  1. Clustering indexes can create indexes that would otherwise bounce up against the limits on the maximum length and maximum number of columns in a MySQL index.
  2. Clustering indexes simplify syntax making them easier and more intuitive to use.
  3. Clustering indexes have smaller key sizes leading to better performance.

Expanding MySQL’s Limits

MySQL allows at most 16 columns in an index and at most 3072 bytes per index. For tables that have more than 16 columns or a row size of greater than 3072 bytes, one cannot create a covering index that includes all of the fields. Suppose one had the following table with 26 columns:

create table foo (a int, b int, c int, ..., y int, z int);

In standard MySQL, one cannot create a covering index for the query:

select * from foo where a=a_num and b=b_num;

However, one can create the following clustering index to speed up the query as:

alter table foo add clustering index (a,b);

Syntactic simplification

The intent of a clustering index definition is easier to understand because it only includes columns necessary to achieve the desired sort order. Columns needed to satisfy the select clause can be left out. As users add more indexes to speed up more queries, simplifying schema definitions becomes increasingly important.

Consider the table:

create table foo (
	a int,
	aa int,
	aaa int,
	aaaa int,
	b int,
	bb int,
	bbb int,
	bbbb int,
	c int,
	cc int,
	ccc int,
	cccc int,
	d int,
	dd int,
	ddd int,
	dddd int
	);

Now suppose you want to speed up this query:

select * from foo where a=a_num and b=b_num;

To create a covering index for this query you do:

alter table foo add index cvr_idx (a,b,aa,aaa,aaaa,bb,bbb,bbbb,c,cc,ccc,cccc,d,dd,ddd,dddd);

To create a clustering index for this you do:

alter table foo add clustering index cvr_idx (a,b);

Another example of the simplification shows up when want to add a column. With clustering indexes you can write

alter table foo add column z int;

and you don’t need to change the index definition. With a covering index, you need to:

alter table foo drop index cvr_idx;
alter table foo add index cvr_idx (a, b, aa, aaa, aaaa, bb, bbb, bbbb, c, cc, ccc, cccc, d, dd, ddd, dddd, z);

If you don’t drop and re-add the covering index, it won’t cover the new column. Covering indexes have similar problems if you drop a column.

Smaller key sizes

The key length of a clustering index is much smaller than that of a covering index. Smaller key lengths have performance benefits. Inserting into a clustering index is often faster than inserting into an equivalent covering index. Some other (non-MySQL) databases provide syntax to include nonkey columns in an index. One can think of clustering indexes as an extreme case, in which all the columns are added as nonkey columns. As Kristian Nielsen phrased it the comments of the earlier posting, the extra fields are stored only in the leaves of a B-Tree, not in the internal nodes. That can be useful if the row size is much larger than the key size. (Although Fractal Tree® indexes aren’t exactly organized like B-trees, this insight is essentially correct.)

Tradeoffs for TokuDB vs. other storage engines

Maintaining clustering indexes has a similar cost to maintaining covering indexes. Specifically, data gets inserted into multiple places. This is the tradeoff for many storage engines: to use clustering indexes to get faster queries, one must use more I/O operations and extra disk storage.

For TokuDB these tradeoffs are a lot easier to make. TokuDB can insert data an order of magnitude faster than other storage engines, while maintaining indexes. Percona measured our insertion rate at 10.6x faster than InnoDB’s, and measured our database size at 5x to 6x smaller than InnoDB and MyISAM. Thus, the tradeoff is more manageable for TokuDB than it is for other storage engines. This was the motivation for implementing the clustering index feature.

Conclusion

Clustering indexes provide the benefits of covering indexes in an easy-to-use manner. Clustering indexes offer immediate performance gains that are difficult or impossible to achieve with covering indexes. TokuDB can maintain indexes while inserting data an order of magnitude faster than other storage engines, and compresses the data, making clustering keys really valuable.

8 thoughts

  1. Oops, adding z to cvr_idx made the index infeasible (17 columns). I guess that confirms the argument about simplicity…

  2. Xaprb says:

    Why are clustering indexes much smaller than covering indexes? Unless I’m misunderstanding your meaning, this is really not true of InnoDB, for example. Or are you talking about your indexes, and not indexes in general?

  3. I may be misunderstanding your question, but I think the point we were trying to make is that the *key* of a clustering index is typically smaller than the *key* of a covering index. Because in MySQL standard syntax, there is no way to add a column to an index without adding it to the key. Bigger keys can suffer performance problems. (For example, they may reduce the fanout of B-tree-like data structures).

    The clustering key syntax allows us to store all the extra columns without throwing them into the key.

  4. Xaprb says:

    Right. The irritating “key” vs. “index” mis-nomenclature in MySQL strikes again. I understand you now!

  5. […] up with the insertion load when you turn them all on. The first problem is solved by using TokuDB clustering indexes. The second is also solved by TokuDB, because it’s so fast at […]

  6. […] since this is TokuDB, you also get to define clustering indexes, which reference all columns and therefore speed up lots of queries. The HCAD message gets injected […]

  7. Harsha Hegde says:

    Once a clustered is created on a table which has other non-clustered indexes, mysql’s ‘show indexes’ will say “Type of Index” is BTREE for both. Which is true in a sense that both are built using a BTREE data-structure. Does Mysql provide another way to find out the real “type” of index? Or does TokuDB provide any utilities that enhance MySQL’s Information Schema to include this information?

    1. zardosht says:

      If you run “show create table table_name;”, that should provide the table’s schema and tell you if the index is clustering or not.

Leave a Reply

Your email address will not be published. Required fields are marked *