In my post from three weeks ago, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. Towards the end of the post, I claimed that it would be better to use “replace into” or “insert ignore” over normal inserts, because the semantics of these statements do NOT require disk seeks. In my post last week, I explained how the command “replace into” can be fast with TokuDB’s fractal trees. Today, I explain how “insert ignore” can be fast, using a strategy that is very similar to what we do with “replace into”.
The semantics of “insert ignore” are similar to that of “replace into”:
- if the primary (or unique) key does not exist: insert the new row
- if the primary (or unique) key does exist: do nothing
B-trees have the same problem with “insert ignore” that they have with “replace into”. They perform a lookup of the primary key, incurring a disk seek. We have already shown how fractal trees do not incur this disk seek for “replace into”, so let’s see how we can avoid disk seeks with “insert ignore”.
The only difference with “replace into” is when the primary (or unique) key exists, instead of overwriting the old row with the new row, we disregard the new row. So, all we need to do is tweak our tombstone messaging scheme (that we use for deletes and “replace into”) so that when “insert ignore” commands do not overwrite old rows with new rows. Similar to deletes and replace into, with this scheme, “insert ignore” can be two orders of magnitude faster than insertions into a B-tree.
Here is what we do. We insert a message into the fractal tree, with a new message “ii”, to signify that we are doing an “insert ignore”. The only difference between this message and the normal “i” message for insertions is what we do on queries and merges. On queries, if the message is an “ii”, then the value in the LOWER node is read, and not the higher node. On merges, if the higher node has a message of “ii”, the value in the LOWER node takes precedence over the value in the higher node.
Let’s look at an example that is similar to what we looked at for “replace into”:
create table foo (a int, b int, primary key (a));
Suppose the fractal tree for this table looks as follows:
- - - - - - - .... (i (1,1)) (i (2,2)) (i (3,3)) (i (4,4)) ... (i (1000,1000)) ... (i (2^32, 2^32))
The ‘i’ stands for insertion message. Now suppose we do:
insert ignore into foo values (1000, 1001).
With fractal trees, we insert (ii (1000,1001)) into the top node. The tree then looks as such:
(ii (1000,1001)) - - - - - - .... (i (1,1)) (i (2,2)) (i (3,3)) (i (4,4)) ... (i (2^32, 2^32))
So upon querying the key ’1000′, a cursor notices that (1000,1001) has a message of “ii”. If it finds another value for the key 1000 in a lower node, it reads that value, otherwise, it reads (1000,1001). Because (1000,1000) is located in a lower node, the cursor returns (1000,1000) to the user. On merges, the message in the lower node, (1000,1000) overwrites the message in the higher node, (1000,1001).
While “insert ignore” can be fast, there are caveats (indexes, triggers, replication), just as there are with “replace into”. In a future posting, I will get into some of them.