What’s new in TokuMX 1.4, Part 4: Smaller, faster sharded clusters

We just released version 1.4.0 of TokuMX, our high-performance distribution of MongoDB. There are a lot of improvements in this version (release notes), the most of any release yet. In this series of blog posts, we describe the most interesting changes and how they’ll affect users.

In the first part of this series, I introduced a new feature, the ability to define the primary key for a collection.  Today, you’ll see how we use it to reduce the disk footprint of sharded clusters.

TokuMX makes the shard key for a collection clustering by default.  This is because it is a good idea to make your shard key something that you would want to range query, and range queries on clustering keys are fast.  Picking a shard key is a hint to TokuMX that you will probably want to query on that index later.

In TokuMX 1.3 and earlier, this means that by default, you have two clustering indexes on each sharded collection: the {_id: 1} index, and your shard key, which means you’re storing every document twice, and every update needs to modify at least those two indexes.  TokuMX’s great compression and insert throughput mean that the size on disk will often be much smaller than the same data set would be in MongoDB anyway, but it’s still an unnecessary duplication and we knew we could do better.

This problem is completely solved with primary keys.  A primary key is just a clustering index that is used to store the primary copy of the data.  The default primary key is the _id index, but if a different primary key is specified, then you get all the benefits of the clustering index, and you still only have one copy of all the documents, they just have a different “$natural” order.

In TokuMX 1.4, the default behavior of sh.shardCollection() is to make the collection’s primary key be the shard key.  The first thing to know is that if the collection doesn’t exist yet, and the first thing you do is run a normal shardCollection command, you’ll get all the benefits of TokuMX sharding, including using a primary key.  Versions 1.3 and earlier of TokuMX don’t support primary keys though, so any shards that are running those versions will still have two copies of the data like before.

You can shard existing collections that have a different (even the default {_id: 1}) primary key and everything will behave normally, but this is usually not the best thing to do, with one exception: hashed indexes.

If you are going to use a hashed shard key, then you aren’t going to need to do migrations, and you can’t do range queries on that index, so there is no reason to cluster that index.  Instead, you should set noAutoSplit=true on all your mongos routers (because autosplitting will be very slow without a clustered shard key) and create the collection this way:

mongos> sh.shardCollection('test.foo', {_id: 'hashed'}, false, false)
{ "collectionsharded" : "test.foo", "ok" : 1 }
mongos> db.foo.getIndexes()
[
	{
		"key" : {
			"_id" : 1
		},
		"unique" : true,
		"ns" : "test.foo",
		"name" : "_id_",
		"clustering" : true
	},
	{
		"key" : {
			"_id" : "hashed"
		},
		"ns" : "test.foo",
		"name" : "_id_hashed"
	}
]

Creating an empty collection this way will pre-split the collection into enough chunks that you should never need to split a chunk, and you will not need to do any migrations unless you add or remove shards.

In version 1.4.0, TokuMX sharded clusters are now even smaller and faster.

Want to check out the newest version of TokuMX?  Download TokuMX 1.4.0 here:

MongoDB Download MongoDB Download

 

Tags: , , , , , , , .

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>