My Favorite MongoDB Replication Feature: Crash Safety

Posted On March 18, 2014 | By Zardosht Kasheff | 8 comments

At an extremely high level, replication in MongoDB and MySQL are similar. Both databases have exactly one machine, the primary (or master), that accepts writes from clients. With a single transaction (or atomic operation, in MongoDB’s case), the tables and oplog (or binary log in MySQL) are modified to reflect the change. The log captures what the change is so other secondaries (or slaves) can read the changes and process them, making the slaves identical to the master. (Note that I am NOT talking about multi-master replication.)

Underneath the covers, their implementations are quite different. And in peeking underneath the covers while developing TokuMX, I learned more about my favorite thing in MongoDB replication: crash safety.

What do I mean by crash safety? After a crash, the user is ensured that the state of the data in collections is in sync with the state of the oplog.

Now, to be fair, as of 5.6, MySQL replication is also crash safe. What I’m saying I like is how MongoDB (and as a result, TokuMX) went about making replication crash safe: by making the oplog another collection that is handled by its storage system.

Unlike MongoDB, in MySQL, the binary log is essentially a different storage engine than its default engine, InnoDB. The binary log is an append only log that stores changes in flat-files. Because the binary log is a different storage engine, keeping InnoDB (or TokuDB) in sync with the binary log requires two-phase commit.

Two-phase commit can be really expensive. Suppose we are using TokuDB for MySQL. To make a transaction durable with binary logging on, we must:

  • Tell TokuDB to “prepare” the transaction for commit. This is phase one and requires an fsync of TokuDB’s recovery log.
  • Write the data to and fsync the binary log.
  • Tell TokuDB the commit has completed, which as of MySQL 5.5, requires another fsync. I think this last fsync requirement is removed from MySQL 5.6.

That is three fsyncs to commit a single transaction on the MySQL master, just to ensure crash-safety. If your MySQL application does not require strict durability, by changing the default value of innodb_flush_log_at_trx_commit, just to get crash safety, you still need to have these fsyncs happening for each transaction commit. Baron Schwartz discusses these issues a bit in this (admittedly old) post. Kristian Nielsen digs into some of the challenges of improving InnoDB over four blog posts.

As for MySQL slaves, up until MySQL 5.6, they simply were not crash-safe. Stephane Combaudon wrote a great post explaining the challenges prior to MySQL 5.6 and the solutions implemented. Reading the post, one sees the problem was needing to keep critical data, the IO thread position and SQL thread position, that was NOT stored in a (InnoDB or TokuDB) table in sync with data that was stored in tables. The solution was to store this data with tables so the storage engine can keep them in sync.

Chained replication also presented a performance challenge. Up until MySQL 5.5, slave replication was single-threaded. Having the slave that is also acting as a master and therefore maintaining a binary log be crash safe would require two-phase commit, but doing it on a single thread would hurt performance.

Lastly, MySQL must have a lot more logic to ensure correctness. Storage engines are required to implement two-phase commit, and MySQL implements a transaction manager to ensure that recovery puts the binary log and table data in sync.

In summary, distributed transactions with multiple storage engines are expensive. They take time to develop, are hard to get right, and hard to get performing well.

MongoDB (and as a result, TokuMX), on the other hand, does not have any of these problems. By making the oplog just another collection that uses their storage engine, making the oplog consistent with collections is trivial. The same transaction that modifies collections also writes to the oplog. When that transaction commits and made durable, all the data is committed together. The only fsync required is if you care about having each transaction be durable (which, if you are using automatic failover, makes little sense). No two-phase commit is required and no transaction manager is required. All machines in the replica set, regardless of the topology (e.g. using chaining) are crash safe. The cost is low, and the solution seems much simpler to implement.

For TokuMX, developing replication this way on top of MongoDB’s design was definitely simpler.

The potential downside to this solution is the cost of adding data to the oplog. Appending data to a flat file, as MySQL does with the binary log, is really cheap. Writing data to a MongoDB capped collection may be more expensive. Writing data to a TokuMX oplog, which is not a capped collection, is definitely more expensive. However, in our experiments, this cost of doing a few writes in succession to the oplog usually (in most workloads) pale in comparison to the cost of making the modifications to the actual collections (corner cases likely exist where this is not true).

8 thoughts

  1. MySQL slaves have been crash safe for me since 2007 or 2008.

    1. Zardosht Kasheff says:

      I am interested in reading about how this was technically achieved. Are there links describing this?

      1. rpl_transaction_enabled – first in the Google patch, improved in the Facebook patch, now deprecated. Slave state was written into a page in the system tablespace as the page had some free space. That page was consistent with anything changed in InnoDB. So it was correct for both roll back and roll forward.

        https://code.google.com/p/google-mysql-tools/wiki/TransactionalReplication
        https://www.facebook.com/note.php?note_id=390420710932

  2. Dharshan says:

    Hi Zardosht,

    Thank you for the detailed post. When you say “The same transaction that modifies collections also writes to the oplog” do you mean MongoDB transactions or some other internal transaction mechanism. Aren’t MongoDB transactions restricted to a single document?

    1. Zardosht Kasheff says:

      Yes, MongoDB transactions are restricted to a single document, but the work done under the transaction includes updating the oplog. Internally, the act of writing to the data heap, secondary indexes, the oplog, and journal can be thought of logically as a transaction.

  3. Matt says:

    Hi Zardosht.
    Thanks for nice article.

    I have one question about this internal two – phase commit of MySQL(+TokuDB).
    According to TokuDB ft-index source code (https://github.com/Tokutek/ft-index/blob/master/ft/txn/txn.cc),

    toku_txn_prepare_txn() and toku_txn_commit_txn() have a little bit different sync mode.

    void toku_txn_prepare_txn (TOKUTXN txn, TOKU_XA_XID *xa_xid) {
    ….
    txn->do_fsync = (txn->force_fsync_on_commit || txn->roll_info.num_rollentries>0);
    ….
    }

    // toku_txn_commit_txn() –> toku_txn_commit_with_lsn()
    int toku_txn_commit_with_lsn(TOKUTXN txn, int nosync, LSN oplsn,
    TXN_PROGRESS_POLL_FUNCTION poll, void *poll_extra)
    {
    ….
    txn->do_fsync = !txn->parent && (txn->force_fsync_on_commit || (!nosync && txn->roll_info.num_rollentries>0));
    ….
    }

    As you can see, toku_txn_commit_txn() take into account nosync parameter and nosync parameter is actually determined by tokudb_commit_sync system variables. So If user set “tokudb_commit_sync=OFF” then toku_txn_commit_txn() is not call fsync() for toku redo log.
    But toku_txn_prepare_txn() is take into account this system variable. So toku_txn_prepare_txn() always call fsync() for tokudb redo log even though user set “tokudb_commit_sync=OFF”.

    Is there any reason that prepare() does always call fsync() for tokudb redo log ?

    1. Matt says:

      Sorry, there’s a important typo.

      >> But toku_txn_prepare_txn() is take into account this system variable. So toku_txn_prepare_txn() always call fsync() for tokudb redo log even though user set “tokudb_commit_sync=OFF”.

      But toku_txn_prepare_txn() is **NOT** take into account this system variable(tokudb_commit_sync). So toku_txn_prepare_txn() always call fsync() for tokudb redo log even though user set “tokudb_commit_sync=OFF”.

    2. Zardosht Kasheff says:

      Can you please send this to our tokudb-dev google group?

Leave a Reply

Your email address will not be published. Required fields are marked *