Making “Replace Into” Fast, by Avoiding Disk Seeks
In this post two weeks ago, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. Towards the end of the post, I claimed that it would be better to use “replace into” or “insert ignore” over normal inserts, because the semantics of these statements do NOT require disk seeks. In this post, I explain how the command “replace into” can be fast with fractal trees.
The semantics of “replace into” are as follows:
- if the primary (or unique) key does not exist, insert the new row
- if the primary (or unique) key does exist, overwrite the existing row with the new row
The slow, expensive way B-trees use to implement these semantics are:
- look up the primary (or unique key), to verify its existence
- if it does not exist, insert the new row, otherwise overwrite the existing row
The first step incurs a disk seek. That slows down performance considerably.
Instead, with TokuDB’s fractal tree data structure, we can follow a similar strategy to what we use for deletes. Recall that for deletes, we do not look up the key we wish to delete, and physically remove its data from the tree. Instead, we use tombstone messaging, or tombstone deletions, to defer the physical removal of the data to a better time. The same idea applies here. Instead of searching for the primary key, as we are forced to do with B-trees, we insert an insertion message into the fractal tree, and defer the existence check for later.
Let us look at an example. Take the following table:
create table foo (a int, b int, primary key (a));
Suppose the fractal tree for this table looks as follows:
- - - - - - - .... (i (1,1)) (i (2,2)) (i (3,3)) (i (4,4)) ... (i (1000,1000)) ... (i (2^32, 2^32))
The ‘i’ stands for insertion message. Now suppose we do:
replace into foo values (1000, 1001).
With fractal trees, we simply insert (i (1000,1001)) into the top node. The tree then looks as such:
(i (1000,1001)) - - - - - - .... (i (1,1)) (i (2,2)) (i (3,3)) (i (4,4)) ... (i (2^32, 2^32))
Similar to deletes, with this scheme, “replace into” can be two orders of magnitude faster than insertions into a B-tree.
On queries, a message in a higher node overrides messages in lower nodes. So upon querying the key ‘1000’, a cursor notices that (1000,1001) is located higher than (1000,1000), and therefore returns (1000,1001) to the user. On merges, the message in the higher node overwrites the message in the lower node.
So, by using messages, fractal trees can achieve the same performance boost for “replace into” as it does for insertions. In fact, using “replace into” in the manner above is how a customer has achieved an 80x speedup under actual field conditions. The details can be found here.
Next week, I explain how “insert ignore” can similarly be fast. Also, while this shows how “replace into” can be fast (and IS fast for a lot of scenarios we see with TokuDB), there are some caveats (with indexes, triggers, and replication). I will get into those in a couple of weeks.