Introducing Multiple Clustering Indexes

Posted On May 27, 2009 | By Zardosht Kasheff | 23 comments

In this posting I’ll describe TokuDB’s multiple clustering index feature. (This posting is by Zardosht.)

In general (not just for TokuDB) a clustered index or a clustering index is an index that stores the all of the data for the rows. Quoting the MySQL 5.1 reference manual:

Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record.

Most storage engines allow at most one clustered index for each table. For example, MyISAM does not support clustered indexes at all, whereas InnoDB allows only the primary key be a clustered index.

In TokuDB, we have added functionality and grammar syntax to define multiple clustered indexes. As a result, users can get the performance advantages of clustered indexes for multiple indexes.

How to define multiple clustered indexes

To define a clustered index in TokuDB, simply add the key word “clustering” in front of any index that is to be created. Here are some examples:

create table foo (a int, b int, clustering key (a)) engine=tokudb;
alter table foo add clustering index clstr_key(a);
create clustering index clstr_key on foo(a);

How clustering keys work

Like InnoDB, TokuDB clusters data with its primary key (if no primary key is defined, then TokuDB auto-generates a hidden one). The primary index maintains a copy of the entire row. Each secondary index stores, for each row, the row’s secondary key and primary key.

A clustering index maintains a copy of the entire row, not just the primary key. As a result, when querying on a clustering key lookups in the primary index are always avoided. A clustering index is a covering index for any query. Another way to think of a clustering index is that it is a materialization of the table sorted in another order.

An example using clustering keys

Suppose we had 40M rows in an iiBench table with this schema:

Create Table: CREATE TABLE `purchases_index` (
 `transactionid` bigint(20) NOT NULL AUTO_INCREMENT,
 `dateandtime` datetime DEFAULT NULL,
 `cashregisterid` int(11) NOT NULL,
 `customerid` int(11) NOT NULL,
 `productid` int(11) NOT NULL,
 `price` float NOT NULL
) ENGINE=TOKUDB AUTO_INCREMENT=40000001 DEFAULT CHARSET=latin1

Now suppose we wanted to perform the following queries in sequence:

  • Query 1:
    select sum(price) from purchases_index where customerid > 51 and customerid < 55;
      

    which scans more than 1200 rows.

  • Query 2:
    select * from purchases_index where customerid = 100;
      

    which returns 406 rows.

  • Query 3:
    select productid, dateandtime from purchases_index where customerid = 200 or customerid=300;
      

    which returns 820 rows.

Ideally, to perform the first query, we would want a covering index of (customerid, price) on the table. To perform the third query, we would want a covering index of (customerid, productid, dateandtime). For the second query, a covering index would need to include all the fields.

With other storage engines, the common solution is to define a key on the field (customerid). Here are the results with TokuDB when using a key on customerid:

Query 1: 9.47 seconds
Query 2: 3.11 seconds
Query 3: 6.16 seconds

Here are the results with MyISAM when using a key on customerid:

Query 1: 2.66 seconds
Query 2: 0.91 seconds
Query 3: 1.78 seconds

Here are the results with TokuDB when using a clustering key on customerid:

Query 1: 0.14 seconds
Query 2: 0.06 seconds
Query 3: 0.20 seconds

As we can see, clustering indexes provide the same order of magnitude of performance as covering indexes, but for a broader range of queries.

A limitation of clustering keys

A clustering key cannot be defined to be unique.

23 thoughts

  1. Eric Day says:

    Hi! I can see this being very useful for some workloads, but I think it is very important to mention the main reason why most engines don’t support multiple clustered indexes: INSERT and UPDATE performance will degrade and disk usage will go up. If you maintain multiple copies of the data, it’s that many more I/O ops that need to happen when the data changes.

  2. @Eric, thanks for the comment – excellent points! Because TokuDB provides very high indexed insertion rates and compression, TokuDB can really make clustering indexes work. In many cases, TokuDB can meet or exceed insertion rate and disk use requirements even with multiple clustering and/or non-clustering indexes, even on large tables.

  3. Kristian Nielsen says:

    How is this any different from a covering index in MyISAM (or InnoDB) obtained by just appending all remaining columns after the indexed keys?

    Especially with the restriction on no unique indexes, that seems to be exactly the same?

    From the description it sounds like this is just a syntax shortcut for creating covering indexes, which are supported in MyISAM and InnoDB (but not Falcon). But a covering index is not a clustered index, since as Eric writes it does not improve insert/update speed in any scenarios.

  4. @Kristian, good questions. I will address covering indexes vs. clustering indexes in a blog post later today.

  5. Venu Kalyan says:

    Can you describe how much extra storage is needed for each add-on clustered index/table ?

  6. @Venu, with TokuDB, we typically measure compression ratios of 3x to 12x relative to the amount of raw data in a table. Since a clustering index is another copy of the table in a different sort order, it will probably consume somewhere between 1/3 and 1/12 of the table’s raw data size.

  7. Kristian Nielsen says:

    Ah, now I think I see the difference between this and normal covering indexes. The difference here is that the extra fields are only stored in the leaf nodes (assuming Btree index). The keys in interior nodes need not store the extra fields, unlike in traditional covering indexes where the extra fields needlessly increase the key size proper.

    That might be useful, especially when the row size is much bigger than the key size (not at all uncommon).

  8. Mark Callaghan says:

    This is very nice. I missed this post when you first wrote it. It can be used to provide much better bounds on worst-case query performance (a few disk reads versus thousands) and do a much better job at making queries ‘online’.

  9. […] Summary: A query like TPC-H Query 17 can be sped up by large factors by using straight_joins and clustering indexes. (This entry posted by […]

  10. […] blogging about TokuDB’s multiple clustering indexes feature, Baron Schwartz suggested we contribute the patch to allow other storage engine to […]

  11. […] recently posted a blog entry on clustering indexes, which are good for speeding up queries. Eric Day brought up the concern that clustering indexes […]

  12. […] I (Zardosht) posted an entry introducing clustering indexes. Here, I elaborate on three differences between a clustering index and a covering […]

  13. […] TokuDB, we have added functionality and grammar syntax to define multiple clustered indexes. As a result, users can get the performance advantages of clustered indexes for multiple indexes. […]

  14. moses wejuli says:

    hi zardosht,

    got a few queries for u here..

    1) i know clustering indexes include all columns for the table, however, am keen to know if we can define multi-column clustering indexes for purposes of sorting e.g.
    — create clustering index clstr_key on foo(a,b,c)
    where column c is sorted relative to b, and b sorted relative to a, etc….. is this feasible?!

    2) where does the index sorting happen with TokuDB? On disk or in memory? at what point do we get back to memory during this process?

    3) with the advent of multiple clustering indexes per table with TokuDB, how useful are single column indexes for multi-row, multi-field result sets? seems to me that single column indexes will only be useful for a single row fetches only, if at all!

    3) what are the advantages of defining a clustering index with a column that has been defined as a single row index elsewhere? or will that only be a waste of index?

    4) what is the maximum number of columns that can be defined as part of a single composite or covering index (please specify for both TokuDB and InnoDB).

    thanks,

    moses.

    1. Zardosht Kasheff says:

      1) Yes, this is feasible. A clustering index can be defined just as a normal index is defined.

      2) Both! For more information, see http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf

      3a) I assume you are asking about the advantages of single column non clustering indexes, such as key(a). Such an index takes less space on disk (so that is an advantage), but covers fewer queries. I don’t think there is an advantage to single row fetches, because if the index is not covering, then this may result in 2 disk seeks, where as a covering index may answer the query in one disk seek.

      3b) I assume you are asking if there is an advantage to having both a clustering key (a) and a key(a). I see no advantage, other than for a small set of queries for which both key(a) and clustering key(a) are covering, key(a) is a little more efficient.

      4) In our releases, the maximum number of columns is 32.

  15. just-working says:

    Hi, I’m not sure where to ask this question. But I’m very interested in this database engine for a high score database. The main index will be user-id,game-id. The secondary index will be game-id,score,timestamp. Would it help if the secondary index is also clustered because then it’s already ordered and it will allow fast retrieval of scores as they are stored in ranking order?

    But that’s what brings me to my next question (my main question). It’s about rank. If I have only the user-id and the game-id of a record, can I find the rank of that user’s score compared to others with the same game-id in O(log n) time? It would use the first index to find the user, then the second index to count the number of records that are higher ranking so that’s two look ups. It seems feasible but it depends on the engine and I don’t know if the engine keeps track of how many nodes there are greater or less than at every node or not. I’m talking about a potentially very large database with many millions of records and hundreds of thousands of records with each game-id, and many of such look ups (maybe 200 of them in a row), so that’s why I am interested in this. 200 O(log n) wouldn’t be so bad if it doesn’t have to perform a count every time.

    1. Zardosht Kasheff says:

      Please email tokudb-user google group. That is the appropriate place for this discussion

  16. Raghu ram reddy says:

    How multiple clustered indexes sorts the data?
    If we have one clustered index then the data will be in sorted order.what if have multiple clustered index then how the data is sorted?can anyone explain physical data when we create multiple clustered index??

    1. Zardosht Kasheff says:

      Each clustering index has a copy of the data sorted in that index’s order

Leave a Reply

Your email address will not be published. Required fields are marked *