Comparing a TokuMX and MongoDB Oplog Entry

Posted On March 26, 2014 | By Zardosht Kasheff | 0 comments

As I mentioned in my last post, TokuMX replication is completely incompatible with MongoDB replication. Replica sets (and sharded clusters, but that is for another blog) must be either entirely TokuMX or entirely MongoDB. This is by design. While elections and failover are basically the same, we have completely changed the oplog protocol.

In the next series of posts, I will describe how TokuMX replication keeps data in sync between machines in the replica set. In doing so, I will address the challenges we faced and algorithms we developed to address them. With so much that has changed, I think the best way to go about understanding how TokuMX replication now works is to first understand what the oplog looks like.

So, let’s peek at a sample TokuMX oplog entry, and compare it to MongoDB’s oplog entry.

Suppose I run the following update that modifies two documents:

rs0:PRIMARY> db.foo.update({a:1},{$inc : {b:1}}, {multi:true})

In MongoDB, this generates the following two oplog entries:


{
       "ts" : Timestamp(1395630045, 1),
       "h" : NumberLong("-5671976232760685793"),
       "v" : 2,
       "op" : "u",
       "ns" : "test.foo",
       "o2" : {
               "_id" : 1
       },
       "o" : {
               "$set" : {
                       "b" : 11
               }
       }
}
{
       "ts" : Timestamp(1395630045, 2),
       "h" : NumberLong("-4250692499231572273"),
       "v" : 2,
       "op" : "u",
       "ns" : "test.foo",
       "o2" : {
               "_id" : 2
       },
       "o" : {
               "$set" : {
                       "b" : 21
               }
       }
}

With TokuMX, performing that update generates the following single oplog entry:

{
       "_id" : BinData(0,"AAAAAAAAAAEAAAAAAAAABQ=="),
       "ts" : ISODate("2014-03-24T03:18:05.181Z"),
       "h" : NumberLong("2597381443224792352"),
       "a" : true,
       "ops" : [
               {
                       "op" : "ur",
                       "ns" : "test.foo",
                       "pk" : {
                               "" : 1
                       },
                       "o" : {
                               "_id" : 1,
                               "a" : 1,
                               "b" : 10
                       },
                       "m" : {
                               "$inc" : {
                                       "b" : 1
                               }
                       }
               },
               {
                       "op" : "ur",
                       "ns" : "test.foo",
                       "pk" : {
                               "" : 2
                       },
                       "o" : {
                               "_id" : 2,
                               "a" : 1,
                               "b" : 20
                       },
                       "m" : {
                               "$inc" : {
                                       "b" : 1
                               }
                       }
               }
       ]
}

Right off the bat, you’ll notice several differences. What I’d like to do now is introduce what each piece of the TokuMX oplog entry means.

  • _id: This field is our global transaction ID, or GTID for short. The GTID identifies a position in the oplog. When converted to hex, it is two sequence numbers concatenated together. This position is used in the election protocol to determine which machine is furthest ahead.
  • ts: The point in time when this oplog entry was generated. In TokuMX, this field is used for diagnostic purposes only. One can look at the timestamp of the last GTID to get a sense of how much lag exists in a replica set. Unlike MongoDB, this field is never used to determine oplog position in election protocols. In MongoDB, this field is a “timestamp” field that consists of a timestamp and auto increment value. In the example above, MongoDB’s two oplog entries have the same timestamp (1395630045), but different auto increment values (1 vs. 2). In MongoDB, this field kinda/sorta works as a GTID.
  • h: This field serves the same purpose in TokuMX as it does in MongoDB, to act as a hash of the oplog stream. Members of a replica use this hash to verify that their data belongs in the replica set and that a rollback is not required.
  • a: This field states if an oplog entry has been applied to the machine. On primaries, this value is always true, because whatever transaction modifies collections also modifies the oplog. On secondaries, for a short period of time, this value is false. With one transaction, we copy the oplog data over and set the value of “a” to false, and in another transaction, we apply the data and update the value of “a” from false to true. A natural question is “why do we do this?” I will explain In a future post.
  • ops: An array of operations that represent the work done by the transaction. Batched inserts and updates/deletes with {multi : true} may generate oplog entries with more than one operation in the ops array. The above sample entry is one example. You’ll note that the data within the operations has a different format than the “op” entry of the two MongoDB oplog entries. That is for a future post as well.

Not every TokuMX oplog entry will have an “ops” field. The reason is that sometimes a large transaction does more work than can fit (or we would want to fit) in a single oplog entry. For such transactions, an oplog entry may look as follows:

{
       "_id" : BinData(0,"AAAAAAAAAAEAAAAAAAAABQ=="),
       "ts" : ISODate("2014-03-24T03:18:05.181Z"),
       "h" : NumberLong("2597381443224792352"),
       "a" : true,
       "ref" : ObjectId("532fa922daaf6e2b4e0ceea5")
}

Note we now have a field named “ref”. This field is a reference into the oplog.refs collection. In this synthetic example, the oplog.refs collection now has the following documents:

{
       "_id" : {
               "oid" : ObjectId("532fa922daaf6e2b4e0ceea5"),
               "seq" : NumberLong(3)
       },
       "ops" : [
               {
                       "op" : "ur",
                       "ns" : "test.foo",
                       "pk" : {
                               "" : 1
                       },
                       "o" : {
                               "_id" : 1,
                               "a" : 1,
                               "b" : 11
                       },
                       "m" : {
                               "$inc" : {
                                       "b" : 1
                               }
                       }
               }
       ]
}
{
       "_id" : {
               "oid" : ObjectId("532fa922daaf6e2b4e0ceea5"),
               "seq" : NumberLong(5)
       },
       "ops" : [
               {
                       "op" : "ur",
                       "ns" : "test.foo",
                       "pk" : {
                               "" : 2
                       },
                       "o" : {
                               "_id" : 2,
                               "a" : 1,
                               "b" : 21
                       },
                       "m" : {
                               "$inc" : {
                                       "b" : 1
                               }
                       }
               }
       ]
}

The _id field has two parts, the reference that was stored in the oplog.rs collection, and a sequence number to link different documents with the same reference. Please note that this example is very synthetic and not realistic. In reality, the array of ops in these entries will be quite large (by default, the first one should be at least 1MB). This provides a mechanism for storing the operations of large transactions that cannot be stored in a single entry

Hopefully, at this point, the TokuMX oplog is understood. You may wonder why we did some of what we did, and I will hopefully address all of that as these series of posts progress.

 

Leave a Reply

Your email address will not be published. Required fields are marked *