MongoDB Index Shootout: Covered Indexes vs. Clustered Fractal Tree Indexes

Posted On September 6, 2012 | By Tim Callaghan | 6 comments

In my two previous blogs I wrote about our implementation of Fractal Tree Indexes on MongoDB, showing a 10x insertion performance increase and a 268x query performance increase. MongoDB’s covered indexes can provide some performance benefits over a regular MongoDB index, as they reduce the amount of IO required to satisfy certain queries.  In essence, when all of the fields you are requesting are present in the index key, then MongoDB does not have to go back to the main storage heap to retrieve anything.  My benchmark results are further down in this write-up, but first I’d like to compare MongoDB’s Covered Indexes with Tokutek’s Clustered Fractal Tree Indexes.

MongoDB Covered Indexes Tokutek Clustered Fractal Tree Indexes
Query Efficiency Improved when all requested fields are part of index key Always improved, all non-keyed fields are stored in the index
Index Size Data is not compressed Generally 10x to 20x compression, user selects zlib, quicklz, or lzma.  Note that non-clustered indexes are compressed as well.
Planning/Maintenance Index “covers” a fixed set of fields, adding a new field to an existing covered index requires a drop and recreate of the index. None, all fields in the document are always available in the index.

When putting my ideas together for the above table it struck me that covered indexes are really about a well defined schema, yet NoSQL is often thought of as “schema-less”.  If you have a very large MongoDB collection and add a new field that you want covered by an existing index, the drop and recreate process will take a long time.  On the other hand, a clustered Fractal Tree Index will automatically include this new field so there is no need to drop/recreate unless you need the field to be part of a .find() operation itself.

The benchmark previously contained an index on URI, but all queries required a lookup (and IO) to get the rest of the fields.  The following change allows lookups by URI to return creation, name, and/or origin without additional IO:

// old index, URI only
// db.tokubench.ensureIndex({URI : 1})

// new covered index
db.tokubench.ensureIndex({URI : 1, creation : 1, name : 1, origin : 1})

Other than the index change, the benchmark is unchanged.  I measured the performance of a single threaded insertion workload combined with a range query retrieving 1000 documents greater than or equal to a random URI.  The range query runs on a separate thread and sleeps 60 seconds after each completed query.

The inserted documents contained the following: URI (character), name (character), origin (character), creation date (timestamp), and expiration date (timestamp).  We created a total of four secondary indexes: URI (clustered), name, origin, and creation date.

We ran the benchmark with journaling disabled and the default WriteConcern of disabled.

My benchmark client is available here.

Benchmark Environment

  • Sun x4150, (2) Xeon 5460, 8GB RAM, StorageTek Controller (256MB, write-back), 4x10K SAS/RAID 0
  • Ubuntu 10.04 Server (64-bit), ext4 filesystem
  • MongoDB v2.2.RC0

Benchmark Results

The exit velocity of standard MongoDB was 2,594 inserts per second at 49 million document insertions versus MongoDB with Fractal Tree Indexes exit velocity of 12,241 inserts per second at 49 million document insertions: an improvement of 371%.  It is important to note that the insertion performance of standard MongoDB was still in steep decline, I suspect that it would continue dropping with further insertions.

Note that this is a latency graph where lower is better.  MongoDB exited with an average of 262 milliseconds per query versus MongoDB with Fractal Tree Indexes average of 62 milliseconds: a 322% improvement.  I cannot explain the shape of the MongoDB graph, it would be interesting to run this experiment out to 100 million inserts.

This benchmark shows that while MongoDB’s covered indexes are helpful to the overall performance of the system, the insertion performance is in steady decline and the query performance is inconsistent.

As I said in my last post, we’re not MongoDB experts by any stretch but we wanted to share these results with the community and get people’s thoughts on applications where this might help, suggestions for next steps, and any other feedback. Also, if you are interested in learning more about TokuDB, please stop by to hear us speak at StrangeLoop, MySQL Connect, and Percona Live

6 thoughts

  1. [...] Tim Callaghan is having a go at the Covered Indexes vs. Clustered Fractal Tree Indexes. [...]

  2. Oliver Falk says:

    Hi!

    This is a nice test. However, I’d be interested to see this test with my dataset. About 19 Mio. well defined XML structed documents with lot’s of fields. About 650 GB of data (unpacked).

    Could you imagine to rerun this test with this data?

    Best,
    Oliver

    1. Tim Callaghan says:

      Oliver,

      We are looking for existing MongoDB users and evaluators to try our implementation on their workloads. Please contact me at tim@tokutek.com to discuss further.

      -Tim

  3. [...] MongoDB Index Shootout: Covered Indexes vs. Clustered Fractal Tree Indexes by Tim Callaghan. [...]

  4. Mongonix says:

    Hi Tim,
    I’m following your posts with a great interest. The improvements you made to MongoDB are really impressive!

    I have a few questions:
    1) Is there any hope those improvements will be included into official upcoming MongoDB versions? Or may be you’ll provide an add-on package that can be installed on top of MongoDB? Are there ongoing discussions between both companies?

    2) It is good that your turned to NoSQL DBs after you have shown many advantages of your technology for MySQL. I also understand why you picked MongoDB, as it is a “MySQL” among NoSQL DBs.

    But do you plan or have you looked into applying similar principles to other popular NoSQL DBs? Do you have any feeling how realistic would it be to apply it there? What would be the effort?

    Specifically, it would be very interesting to hear your opinion and plans regarding bringing your technology to such DBs like:
    – CouchBase/CouchDB (also document-oriented DB like MongoDB)
    – OrientDB and Neo4j (both are graph DBs. Plus OrientDB can also work as a document-based. Both are implemented in Java)
    – Cassandra and Co?

    Thanks in advance for any feedback!

    1. Tim Callaghan says:

      Mongonix,

      Great questions.

      1) We have several MongoDB users evaluating the combination of MongoDB + Fractal Tree Indexes. Our current plan is to continue the development effort, analyze feedback from evaluators, and make a decision in the near future.
      2) Our current focus is on enhancing our core Fractal Tree Index technology, TokuDB, and the MongoDB + Fractal Tree Index project. If we pursue additional implementations it will likely be on systems that manage large data and implement B-trees.

Leave a Reply

Your email address will not be published. Required fields are marked *