High Anxiety Whenever You’re Near

Posted On April 26, 2009 | By Bradley C. Kuszmaul | 4 comments

Every time I visit the Sun Santa Clara Campus, I’m reminded of Mel
Brooks’s movie “High Anxiety”. The campus was known as The Great
Asylum for the Insane in the 19th century, and even includes a tower.

High Anxiety,
whenever you’re near.
High Anxiety,
it’s you that I fear.

I went to the MySQL Storage Engine (SE) Summit held on the Sun
campus in Santa Clara. I thought it was a great meeting, and many
thanks to Sanjay for inviting us. Also attending from Tokutek were
Zardosht and Tom. We heard interesting points of view from SE
implementers such as Akiba, ScaleDB, InnoDB, PBXT, and Virident, as
well as from the Sun/MySQL implementors. Here are a few highlights:

Everyone agrees that the Storage Engine (SE) API needs better
documentation.

The InnoDB team suggested that one approach to simplifying the SE
API is to have everyone program to something like the InnoDB embedded
API, which is cleaner than the SE API. I like that approach. For
TokuDB, we chose to program to the Berkeley DB API, and then adapted
the Berkeley DB SE handler for our purposes. The main difference
between the InnoDB embedded API and the Berkeley DB API that we use is
that InnoDB understands that rows are made of columns, whereas
Berkeley DB treats the rows as undifferentiated blobs. Unfortunately,
the InnoDB API would need some work before it could really solve the
problem. For example, various collation orders are not really
understood by the InnoDB embedded API without calls back into MySQL.
Another problem with simplifying the API this way is that the embedded
APIs don’t address sophisticated approaches such as QFP, BKA, and MRR.

It sounds like Sun has a serious plan for incorporating innovations
from the effectively orphaned 6.0 branch back into the mainline for
5.4 and beyond. Each of the following projects is essentially tasked
to “become beta-ready” and then they can be incorporated into the main
line. Sun thinks they can handle about one major GA release every 12
to 18 months.

Mikael and Serge talked about Query Fragment Pushdown (QFP), which
is one such innovation. The idea is to give the SE access to the
query so that the SE can make its own query plan or do larger chunks
of work independently. Part of the implementation plan is to define a
clean abstract syntax for queries and a clean description of databases
and indexes. When questioned when this will appear in the 5.4 line,
the answer is basically “when it’s ready”.

Philip has been working on an independent testing framework for
SEs. I think it will make SE implementation a lot easier to have this
kind of test suite. Philip places a lot of emphasis on testing crash
recovery. Apparently many SEs claim to have recovery, but don’t
actually do very well. I think that this testing framework is the
best answer to the SE API documentation problem. A test suite is
better than a specification. This work is basically ready to
distribute to SE implementers.

Sergey talked about multi-range read (MRR) and batched key access
(BKA). MRR is useful when you don’t have a covering index. The idea
is that instead of stepping through a range of indexes, the system
scans the index to find row ids, and then sorts the row ids according
to the clustering key of the primary table. That way, the primary
data can be accessed in one sweep through the table. BKA reduces the
number of round trips between the MySQL join engine and the SE by
returning multiple values in a single call. When will MRR and BKA be
in the main line? “When they are ready.”

Lars talked about the backup interface. The backup interface seems
well organized. I would suggest that the backup interface can be made
easier to use for SE implementers. Currently the interface is
designed so that the backup module repeatedly calls the SE asking
“fill this block with data that should be backed up.” It would be
advantageous if the SE were given a callback so that the SE could say
“here’s a block of data that should be backed up.” Why is this
simpler? Consider the problem of backing up an in-memory binary tree.
Given the “backup_this_data” interface, here’s what a backup code
might look like inside the SE:

backup(node)
  if (node.is_leaf)
    backup_this_data(node.leaf_block);
  else 
    backup_this_data(node.internal_block);
    backup(node.left_child);
    backup(node.right_child);

Given the current interface, the SE would have to simulate the call
stack to implement this recursion, instead of simply using the C++
call stack to implement the recursion.

Basically, if an SE has a complex data structure, it is a great
help to not require the SE coder to implement the continuation that
remembers how to back up “the rest” of the database.

I also think that the buffer should be provided by the SE, and the
backup module should copy data out of the buffer before returning to
the SE. This approach admits the possibility that the SE could
provide a pointer to an internal buffer which would then be written
directly to the backup file, avoiding any memcpy operations on the
data.

It’s not clear that our (Tokutek’s) users will want this kind of
backup module, however. They might prefer to perform backups at the
file system level. For TokuDB, InnoDB, or Berkeley DB, to perform
backup, one can essentially back up the database files and logs using
file system operations. (For InnoDB, one should probably use LVM
snapshots to make this work fast, but with TokuDB and Berkeley DB, one
can copy the files as they are being modified by the database.) It’s
hard to imagine that going through the SE will be anywhere near as
fast as simply copying the files.

For InnoDB, after taking a snapshot using LVM, and taking a backup
from the snapshot, InnoDB will need to perform recovery before the
backup will be useful. Since InnoDB sometimes takes a long time to
recovery, it seems to me that one should perform recovery on the
snapshot before writing it to the backup media. That way, in the
event of a disaster, the recovered database will be ready to go sooner
rather than later.

It’s crucial that the backup system provide a copy that can be
“started up” quickly. It’s not clear that the backup interface can be
made to implement that functionality.

The backup interface will be ready when? “When it’s ready.”

Sergei talked about much work going into to improving the query
optimizer’s cost model. Much of this is already on the main line. I
would look for patches soon for the “explain optimization” tool, which
gives detailed explanation of the optimizer.

I missed some of the other discussions, such as replication.

And remember folks
Be good to your parents
They’ve been good to you.

4 thoughts

  1. Mark Callaghan says:

    Thanks for the details. Where is the mythical storage engine indepedent functional test suite? The storage engine team at MySQL has been talking about this for about 1.5 years.

  2. Matthew Montgomery says:

    The benefit for ToukDB/BDB to implement this native backup driver is that it gives us the ability to integrate the replication state into the backup and to give a user with multiple engines in the same database a single, simple command with with to take their backup.

  3. Mark, we were promised access to the test suite “real soon”. Zardosht really wants it. Tom is bugging Sun even as we speak. I’m hoping they’ll deliver.

    -Bradley

  4.  | Tokutek says:

    [...] we chose to program to the Berkeley DB API, and then adapted the Berkeley DB SE handler for our purposes. The main difference between the InnoDB embedded API and the Berkeley DB API that we use is that [...]

Leave a Reply

Your email address will not be published. Required fields are marked *