Introducing Ark: A Consensus Algorithm For TokuMX and MongoDB

Most of the time, our blog posts explain what’s great about the MongoDB improvements we’ve already shipped in TokuMX. Sometimes, though, it’s fun to talk about what’s coming soon, especially when user feedback would really help get the feature right. In my next series of blog posts, I get to geek out and talk about a feature we have been developing that I personally find really exciting: Ark.

What is Ark?

Ark is an implementation of a consensus algorithm (also known as elections) similar to Paxos and Raft that we are working on to handle replica set elections and failovers in TokuMX. It has many similarities to Raft, but also has some big differences.

Here is a tech report that explains the algorithm and provides proofs of correctness. Please download it and tell us what you think.

Why design Ark?

In short: to fix known problems with the election protocol used by TokuMX and MongoDB.

As many know, MongoDB’s existing election protocol has issues. Kyle Kingsbury, known as “aphyr” on twitter, showed some basic behavioral problems when analyzing MongoDB as part of his Jepsen series of blog posts analyzing the impact of network partitions on distributed databases. A conclusion he arrived at was “MongoDB is neither AP nor CP”, when considered in the context of the CAP theorem.

Because TokuMX inherited MongoDB’s election protocol, we inherited these problems, and we want to fix them.

Our main goal is to modify the election protocol to make TokuMX a true CP system. That is, in the face of network partitions, TokuMX will remain consistent. To do so means ensuring that any write that is successfully acknowledged with majority write concern is never lost in the face of a network partition. This is not currently the case for TokuMX and MongoDB.

Additionally, we want to fix other known user experience issues that we know of, such as SERVER-9848 and SERVER-8084, along with issues we discovered while analyzing the code, such as SERVER-14382. The high level goal is to improve failover behavior. The tech report linked above details the issues we see and our approach for fixing them.

Is Ark implemented?

Currently, yes. We have an implementation on github in the election3-sandbox branch of TokuMX.

But we have yet to ship it.

That paper is heavy reading! Is there a simple explanation of what has been done?

Not yet :). Over the next few series of posts, I will be explaining in layman’s terms what the algorithm is, in smaller, more digestible pieces. So stay tuned…

Are there any shortcomings to Ark?

Nothing is perfect :). We hope the community can give us feedback on what we’ve done. The biggest shortcoming of Ark is that it currently does not address replica set reconfigurations. The Raft paper does. We have not addressed reconfigurations yet because we are looking to improve this area in steps. In short, we are busy and need to manage this project in incrementally. We want to get this important first step of fixing majority write concern and other important issues done, before addressing what we hope is a more rare scenario of handling configuration changes during network partitions.

What can the community do?

In short, give us feedback, any feedback. If there is feedback from reading the tech report, we’d love to hear it. If there is feedback from reading the code, we’d love to hear it. If anyone would like to run the code, please let us know and we will happily provide a not-for-production binary containing the new algorithm.

So, in short, tell us anything.

Tags: , , , , , .

4 Responses to Introducing Ark: A Consensus Algorithm For TokuMX and MongoDB

  1. Wow. Thanks for documenting a few serious flaws. And more thanks for fixing them. Do you include Jepsen tests in your QA?

  2. Andy says:

    Why is Ark limited to MongoDB? Will it be used for TokuDB on MySQL?

    • Zardosht Kasheff says:

      Ark is designed for MongoDB because it is built on top of MongoDB’s existing replication and HA algorithms.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>