Introducing Ark: A Consensus Algorithm For TokuMX and MongoDB

Posted On July 18, 2014 | By Zardosht Kasheff | 4 comments

Most of the time, our blog posts explain what’s great about the MongoDB improvements we’ve already shipped in TokuMX. Sometimes, though, it’s fun to talk about what’s coming soon, especially when user feedback would really help get the feature right. In my next series of blog posts, I get to geek out and talk about a feature we have been developing that I personally find really exciting: Ark.

What is Ark?

Ark is an implementation of a consensus algorithm (also known as elections) similar to Paxos and Raft that we are working on to handle replica set elections and failovers in TokuMX. It has many similarities to Raft, but also has some big differences.

Here is a tech report that explains the algorithm and provides proofs of correctness. Please download it and tell us what you think.

Why design Ark?

In short: to fix known problems with the election protocol used by TokuMX and MongoDB.

As many know, MongoDB’s existing election protocol has issues. Kyle Kingsbury, known as “aphyr” on twitter, showed some basic behavioral problems when analyzing MongoDB as part of his Jepsen series of blog posts analyzing the impact of network partitions on distributed databases. A conclusion he arrived at was “MongoDB is neither AP nor CP”, when considered in the context of the CAP theorem.

Because TokuMX inherited MongoDB’s election protocol, we inherited these problems, and we want to fix them.

Our main goal is to modify the election protocol to make TokuMX a true CP system. That is, in the face of network partitions, TokuMX will remain consistent. To do so means ensuring that any write that is successfully acknowledged with majority write concern is never lost in the face of a network partition. This is not currently the case for TokuMX and MongoDB.

Additionally, we want to fix other known user experience issues that we know of, such as SERVER-9848 and SERVER-8084, along with issues we discovered while analyzing the code, such as SERVER-14382. The high level goal is to improve failover behavior. The tech report linked above details the issues we see and our approach for fixing them.

Is Ark implemented?

Currently, yes. We have an implementation on github in the election3-sandbox branch of TokuMX.

But we have yet to ship it.

That paper is heavy reading! Is there a simple explanation of what has been done?

Not yet :). Over the next few series of posts, I will be explaining in layman’s terms what the algorithm is, in smaller, more digestible pieces. So stay tuned…

Are there any shortcomings to Ark?

Nothing is perfect :). We hope the community can give us feedback on what we’ve done. The biggest shortcoming of Ark is that it currently does not address replica set reconfigurations. The Raft paper does. We have not addressed reconfigurations yet because we are looking to improve this area in steps. In short, we are busy and need to manage this project in incrementally. We want to get this important first step of fixing majority write concern and other important issues done, before addressing what we hope is a more rare scenario of handling configuration changes during network partitions.

What can the community do?

In short, give us feedback, any feedback. If there is feedback from reading the tech report, we’d love to hear it. If there is feedback from reading the code, we’d love to hear it. If anyone would like to run the code, please let us know and we will happily provide a not-for-production binary containing the new algorithm.

So, in short, tell us anything.

4 thoughts

  1. Wow. Thanks for documenting a few serious flaws. And more thanks for fixing them. Do you include Jepsen tests in your QA?

    1. Zardosht Kasheff says:

      The difficulty with running Jepsen is that we’ve struggled using it to hit these bugs in TokuMX 1.5. Luckily, TokuMX 1.5 has some changes over basic MongoDB that makes this bug harder to hit. As I’ll detail in future posts, basic MongoDB currently uses a timestamp as the most significant bits to their GTID. As I mention in http://www.tokutek.com/2014/03/comparing-a-tokumx-and-mongodb-oplog-entry/, we use two sequence numbers as our GTID. I did not say so in the post, but the reason we did this is we anticipated fixing elections with something like Ark.

      I think the reason we could not get Jepsen to hit this bug with TokuMX is that our GTID made it much harder to hit. We used carefully crafted unit tests for our QA so far. They can be seen in the election3-sandbox branch I link to in the post. A couple of the main tests we ran are those I created in https://groups.google.com/forum/#!searchin/mongodb-dev/transient/mongodb-dev/-mH6BOYyzeI/zYJzFuCZuesJ and https://groups.google.com/forum/#!searchin/mongodb-dev/SERVER-9848/mongodb-dev/WA–aofOjQI/cF2OBqorZxkJ

  2. Andy says:

    Why is Ark limited to MongoDB? Will it be used for TokuDB on MySQL?

    1. Zardosht Kasheff says:

      Ark is designed for MongoDB because it is built on top of MongoDB’s existing replication and HA algorithms.

Leave a Reply

Your email address will not be published. Required fields are marked *