What’s new in TokuMX 1.4, Part 3: Optimized updates

Posted On February 19, 2014 | By Leif Walsh | 3 comments

We just released version 1.4.0 of TokuMX, our high-performance distribution of MongoDB. There are a lot of improvements in this version (release notes), the most of any release yet. In this series of blog posts, we describe the most interesting changes and how they’ll affect users.

In TokuMX 1.4.0, we improved performance by making two big changes to how updates are implemented:

  1. On a single machine, updates that modify a document without modifying any indexed fields send a smaller message into the primary key to modify the document.  Instead of sending a full copy of the modified document down the tree to overwrite the old document, TokuMX encodes just the modification in a message so that it gets applied to the existing row when it reaches the bottom of the tree.
  2. For the right workload, this can drastically reduce the amount of space in the cache used by internal node buffers, effectively increasing the amount of cache available for your working set.In the oplog, instead of logging the old and new documents, just the old document is logged along with the update modifications, and the new document is reconstructed on the secondaries.  In extreme cases, this can reduce the size of the oplog and network bandwidth used for replication to approximately ½ of its pre-1.4 size.

In MongoDB, updates are performed by first searching for the document to update, then applying the update in-place if it can be done without growing the document, or by making a new copy of the document and then updating all the indexes regardless of the effect of the update.  This process is the same on all secondaries, which means that for working sets larger than RAM, update workloads can quickly consume all of the I/O resources in the cluster.

In TokuMX, updates still do a search on the primary, to find the original document, but after this point, we don’t do any more random I/O for the update.  Instead, we send messages into the affected indexes (how all modifications to Fractal Tree indexes happen) that will delete invalidated entries and replace them with new versions.  Since TokuMX secondary indexes use logical rather than physical primary key identifiers, there is nothing special about growing a document, only the keys actually affected by the update modification will be updated, and this is all done without any further random I/O, the same as normal inserts.

For the secondaries in a replica set, TokuMX puts more information in the oplog than MongoDB does, so that the secondaries have enough information just from the oplog entry to replicate the update (deletes and inserts into affected indexes) without doing any I/O at all, which makes them much more effective for scaling read workloads.  However, for coding simplicity, until 1.4.0, the entire old and new versions of the document were put into the oplog.

Since MongoDB, like many NoSQL databases, encourages denormalization, many existing MongoDB applications frequently update small portions of large documents.  For a small update, TokuMX might log two copies of a very large document.  These would tend to compress well in the oplog, but it is still a significant waste of space, and the MongoDB wire protocol doesn’t support compression (yet), so this inflates network traffic in a replica set.

There is still work remaining, beyond the two changes above, to reduce the oplog size further in future releases, but this is a big step forward for some of our users.

Most users should simply upgrade and immediately see the benefits.  One thing to keep in mind about this is that, since this change adds a new type of message that a primary can put in the oplog, it is unsafe to have a replica set with a primary at version 1.4.0 or greater and any secondaries at an earlier version.  To upgrade a replica set, one must upgrade all secondaries first (without letting upgraded nodes step up to become primary—their priority can be set to zero to ensure this), then step one of them up to be the primary and upgrade the original primary.  This process is covered fully in the Users Guide.

Want to check out the newest version of TokuMX?  Download TokuMX 1.4.0 here:

MongoDB Download MongoDB Download

 

3 thoughts

  1. James Blackburn says:

    I have a pull-request for MongoDB to make small-updates to large documents much more efficient:
    https://jira.mongodb.org/browse/SERVER-12886

    MongoDB is actually worse than this by default. If an update is made to a an indexed field or an array, the entire document is always re-written. This can be pretty pathological for large documents and multi-updates.

    With this change I/O generated is approximately size of the update rather than size of document.

    1. Leif Walsh says:

      This is interesting, but doesn’t really apply to TokuMX. TokuMX doesn’t update data in-place on disk, the data structure reorders and consolidates lots of small writes into larger block writes, so even if MongoDB accepts your pull request, it is irrelevant for TokuMX (that code has already been replaced), and TokuMX will still write far less than MongoDB. Also, an optimization like yours doesn’t work in a context with MVCC.

      But, it’s a good optimization and I hope they take it.

      1. James Blackburn says:

        Oh agreed. I was just pointing out that improvements can be made to optimize updates in the MongoDB storage engine :).

        I’m looking at your sysbench and noticed that the it uses quite small document sizes. It would be interesting to have an option to try more realistic larger document sizes too.

Leave a Reply

Your email address will not be published. Required fields are marked *