My Favorite MongoDB Replication Feature: Crash Safety
At an extremely high level, replication in MongoDB and MySQL are similar. Both databases have exactly one machine, the primary (or master), that accepts writes from clients. With a single transaction (or atomic operation, in MongoDB’s case), the tables and oplog (or binary log in MySQL) are modified to reflect the change. The log captures what the change is so other secondaries (or slaves) can read the changes and process them, making the slaves identical to the master. (Note that I am NOT talking about multi-master replication.)
Underneath the covers, their implementations are quite different. And in peeking underneath the covers while developing TokuMX, I learned more about my favorite thing in MongoDB replication: crash safety.
What do I mean by crash safety? After a crash, the user is ensured that the state of the data in collections is in sync with the state of the oplog.
Now, to be fair, as of 5.6, MySQL replication is also crash safe. What I’m saying I like is how MongoDB (and as a result, TokuMX) went about making replication crash safe: by making the oplog another collection that is handled by its storage system.
Unlike MongoDB, in MySQL, the binary log is essentially a different storage engine than its default engine, InnoDB. The binary log is an append only log that stores changes in flat-files. Because the binary log is a different storage engine, keeping InnoDB (or TokuDB) in sync with the binary log requires two-phase commit.
Two-phase commit can be really expensive. Suppose we are using TokuDB for MySQL. To make a transaction durable with binary logging on, we must:
- Tell TokuDB to “prepare” the transaction for commit. This is phase one and requires an fsync of TokuDB’s recovery log.
- Write the data to and fsync the binary log.
- Tell TokuDB the commit has completed, which as of MySQL 5.5, requires another fsync. I think this last fsync requirement is removed from MySQL 5.6.
That is three fsyncs to commit a single transaction on the MySQL master, just to ensure crash-safety. If your MySQL application does not require strict durability, by changing the default value of innodb_flush_log_at_trx_commit, just to get crash safety, you still need to have these fsyncs happening for each transaction commit. Baron Schwartz discusses these issues a bit in this (admittedly old) post. Kristian Nielsen digs into some of the challenges of improving InnoDB over four blog posts.
As for MySQL slaves, up until MySQL 5.6, they simply were not crash-safe. Stephane Combaudon wrote a great post explaining the challenges prior to MySQL 5.6 and the solutions implemented. Reading the post, one sees the problem was needing to keep critical data, the IO thread position and SQL thread position, that was NOT stored in a (InnoDB or TokuDB) table in sync with data that was stored in tables. The solution was to store this data with tables so the storage engine can keep them in sync.
Chained replication also presented a performance challenge. Up until MySQL 5.5, slave replication was single-threaded. Having the slave that is also acting as a master and therefore maintaining a binary log be crash safe would require two-phase commit, but doing it on a single thread would hurt performance.
Lastly, MySQL must have a lot more logic to ensure correctness. Storage engines are required to implement two-phase commit, and MySQL implements a transaction manager to ensure that recovery puts the binary log and table data in sync.
In summary, distributed transactions with multiple storage engines are expensive. They take time to develop, are hard to get right, and hard to get performing well.
MongoDB (and as a result, TokuMX), on the other hand, does not have any of these problems. By making the oplog just another collection that uses their storage engine, making the oplog consistent with collections is trivial. The same transaction that modifies collections also writes to the oplog. When that transaction commits and made durable, all the data is committed together. The only fsync required is if you care about having each transaction be durable (which, if you are using automatic failover, makes little sense). No two-phase commit is required and no transaction manager is required. All machines in the replica set, regardless of the topology (e.g. using chaining) are crash safe. The cost is low, and the solution seems much simpler to implement.
For TokuMX, developing replication this way on top of MongoDB’s design was definitely simpler.
The potential downside to this solution is the cost of adding data to the oplog. Appending data to a flat file, as MySQL does with the binary log, is really cheap. Writing data to a MongoDB capped collection may be more expensive. Writing data to a TokuMX oplog, which is not a capped collection, is definitely more expensive. However, in our experiments, this cost of doing a few writes in succession to the oplog usually (in most workloads) pale in comparison to the cost of making the modifications to the actual collections (corner cases likely exist where this is not true).