Disk seeks are evil, so let’s avoid them, pt. 3 (Deletions)
As mentioned in parts 1 and 2, having many disk seeks are bad (they slow down performance). Fractal tree data structures minimize disk seeks on ad-hoc insertions, whereas B-trees practically guarantee that disk seeks are performed on ad-hoc insertions. As a result, fractal tree data structures can insert data up to two orders of magnitude faster than B-Trees can.
In this post, let’s examine deletions, and get an intuitive understanding for why fractal-tree data structures exhibit the same two orders of magnitude faster deletions than B-trees. In MySQL 5.1, this advantage is really eye-popping for TokuDB v. InnoDB, because InnoDB does not use its insert buffer for deletions. I understand there is a delete buffer in 5.5, which I haven’t experimented with yet.
B-trees exhibit the same weakness on deletions as they do on insertions: they need to have the appropriate leaf node in memory. For large tables, bringing the leaf node into memory often requires a disk seek. Fractal tree data structures do not have this requirement.
Before going on, a clarification. In MySQL, delete statements have two steps: queries and value changes. For instance, the statement:
delete from foo where a=1;
must first query all rows where a=1 (the first step), and then proceed to remove the rows that are found (the second step). In this post, we focus on the second step. For storage engine developers, this is the function handler::delete_row. In a future post, I will analyze the first step, tie it together with this post, and show how deletions can be fast in MySQL with TokuDB.
Back to deletions. Let’s analyze value changes. We know the contents of the row being deleted. So how can fractal tree data structures avoid an unnecessary disk seek? The answer: deletion messages (sometimes called tombstone deletes).
Suppose we have a fractal tree data structure with the following elements inserted: (1), (3),…(999). Up until now, we shown the fractal tree as such.
- - - - - - - ... 1 3 5 7 9 ... 999
In reality, the elements stored are not just keys, but rather (message, key) pairs. The message may be one of two operations: insertion or deletion. We represent an insertion with the message ‘i’. So, the fractal tree looks more like this:
- - - - - - - ... (i,1) (i,3) (i,5)... (i,999)
To delete an element, for example (5), we insert a deletion message into the tree, marking it with a ‘d’. So, after deleting (5), the fractal tree data structure looks like this:
(d,5) - - - - - - ... (i,1) (i,3) (i,5)... (i,999)
With this scheme, deletions are as fast as insertions, which is to say two orders of magnitude faster than insertions or deletions into a B-tree.
On queries, a message in a higher node overrides messages in lower nodes. So upon querying (5), a cursor notices that (d,5) is located higher than (i,5), and therefore the key (5) does not exist in the fractal tree data structure. On merges, the deletion message and insertion message cancel each other out, and space is reclaimed.
So, by using deletion messages and treating deletions like insertions, fractal tree data structures can achieve the same performance boost (two orders of magnitude) over B-trees.