Don’t worry about embedding large arrays in your MongoDB documents
In this post, I’d like to discuss some performance problems recently mentioned about MongoDB’s embedded arrays, and how TokuMX avoids these problems and delivers more consistent performance for MongoDB applications.
In “Why shouldn’t I embed large arrays in my documents?“, Asya Kamsky of MongoDB explains why you shouldn’t embed large arrays in your MongoDB documents. It’s a great article and an in-depth study of some of the decisions made in MongoDB, and how those decisions affect its behavior and rules for effective usage.
There are three main reasons Asya gives why large embedded arrays are harmful:
If the array grows frequently, it will grow the containing document, causing the document to be moved on disk instead of being rewritten. It’s common knowledge that MongoDB “document moves” are slow, because every index must be updated.
If the array field is indexed, one document in the collection is responsible for a separate entry in that index for each and every element in its array. So inserting or deleting a document with a 100-element array, if that array is indexed, is like inserting or deleting 100 documents in terms of the amount of indexing work required.
Asya doesn’t say this explicitly, but alludes to it: the BSON data format manipulates documents with a linear memory scan, so finding elements all the way at the end of a large array takes a long time, and most operations dealing with such a document would be slow.
TokuMX mitigates the first two of these problems with large arrays:
TokuMX uses logical identifiers for documents in secondary indexes. This means that no matter how you update a document, you only incur index maintenance on the fields that are indexed. Documents in TokuMX don’t incur a “move penalty” if they grow, they simply grow in the primary key index and that’s that.
TokuMX has unmatched indexed insertion speed, so it can easily handle indexing even very large arrays. Asya also mentions that all of this indexing work must be done atomically, alluding to the fact that this would block any other operations while the index maintenance is happening, but TokuMX supports concurrent reads and writes. Indexing a large array is a strange thing to do and most of the time should probably be modeled differently, but it’s nice to know that TokuMX can handle it if you need it.
TokuMX doesn’t change the BSON layout of documents in memory, so manipulations of large arrays can still be slow, but certainly no slower than the same manipulations in MongoDB. If you need to do these kinds of slow calculations, TokuMX eliminates the problematic db-level write lock, so you get better concurrency in TokuMX than MongoDB, and these problems won’t affect other clients on the system.
We think embedded arrays are a great feature of the data model that contributes to the wonderful productivity developers can achieve with MongoDB. With TokuMX, you can use them fearlessly, stop worrying about database performance, and just make great products.