TokuDB Hot Backup – Part 2

Posted On September 19, 2013 | By Christian Rober | 6 comments

In my last post, I discussed the existing backup solutions for MySQL.  At the end I briefly discussed why the backup solutions for InnoDB do not apply to TokuDB.  Now I’m going to outline the backup solution we created.  Our solution works for both TokuDB and InnoDB.  It also has no knowledge of the log files and does not require any changes to either storage engine.  In fact, the library could be used with almost any process; it has no knowledge of what types of files are being backed up.

Shims

Tokutek’s Hot Backup is essentially a shim between the mysqld process and the operating system (Linux only, at this point.)  It is a separately compiled C++ library that simply gets linked into the mysqld application at the end of the respective build process.  We ship this library with our own enterprise versions of both MySQL and MariaDB.

 hotbackupblog2_1

The magic of this shim is that it intercepts all relevant file system calls made to the Linux kernel by mysqld .  Any file that is opened, read from, written-to, renamed, unlinked and/or closed by MySQL is intercepted by our hot backup library.  Directory creation and removal are also intercepted.  This is all transparent to MySQL.  Again, no changes to the core MySQL system were required to intercept these system calls, just the addition of the library at link time.

I should note, at this point, that the library does expose a few C-Style functions and callbacks.  We added the appropriate plumbing and syntax to allow users to call this API and interact with the library.  Users can execute commands in SQL that initiate a backup, throttle (slow down) a backup, and get backup progress/error reporting.  However, the key idea is the same: all of the changes made to database files pass through the backup library, without any configuration required by the user.

The fact that our backup library can see every file operation since the epoch of the process helps achieve consistency between the backup copy and the original database files, even while there are read and write workloads occurring.  We keep track of every open file and the different file offsets used, for reads and writes, by mysqld.  To do this, we create some state, in memory, that mirrors the same state in the file system.

Whenever mysqld makes a file system call, it actually calls our backup library instead.  Our library eventually makes the call to the actual file system on behalf of mysqld.  During this system call interception we create our own in-memory mirror of the file system state.  This includes the full path to the original file, the integer file descriptor associated with the respective file, and that file descriptor’s offset.

hotbackupblog2_2

As mysqld reads and writes to the file, which in this case is usually a file representing a database, we update the file offset for that file.  This occurs even if no backup is in progress.  Once a backup is initiated, we begin copying each file.

Locks

As seen in our last blog, we need to prevent races between our copy and mysqld’s writes to the same file.  We do this by locking each segment of the file as we copy it.  Most writes mysqld performs will not block on this lock.  In the rare case that mysqld is trying to write to the same segment that is being copied, one will wait for the other to finish.

If an UPDATE mysqld path wins the race, the backup library will block till the update/write is done.  Once mysqld finishes the write, the library will copy the newly altered data.

hotbackupblog2_3

 

The more interesting case is when our backup library wins the race.  Once the library finishes copying the data, it will release the lock, allowing mysqld to alter that data.  In this case the backup copy of the data will be stale, it won’t match the original.  The copy of that data segment on the backup will not have the most recent change.

hotbackupblog2_4

 

The solution for this situation is simple.  During a backup, we apply any changes mysqld makes to both the original file AND the backup file.  This occurs even if the backup library has yet to copy the respective file.  This does require every write to be written to two different files, but again, this does not occur if there is no backup in progress.

This diagram shows what the library does when it wins the race and must apply the changes from myqld.  This occurs after the library relinquishes the data segment lock to mysld.

hotbackupblog2_5

Results

At the end of a backup we have a copy of all our original database and log files.  This backup data can be used to start a new instance of TokuDB.  When you start TokuDB with the backup files, it performs recovery, using the log to remove any uncommitted transactions.  These transactions must be undone, similar to crash recovery, because backup does not end on a transactional boundary.

Remember, our backup library has no awareness of the log or how it relates to the database files.  This is actually OK.  We have enough information to recover the database to a state VERY close to the time the backup finished (this time is reported in the log in the original mysqld’s error log.)  Users will still end up with a consistent and correct database.  The caveat is that any active transactions that failed to commit, precisely when hot backup finishes, will be undone upon recovery.

Users are now able to take hot backups of an active system running TokuDB, with no downtime.  The backup library does not use much memory, and does not spoil the cache used for the tables.  The backup process can also be throttled so that it copies the files at a slower rate.  This throttling is especially useful if there are frequent disk accesses, such as when the data being processed does not fit in main memory.

Next week I will be showing how we integrated the hot backup library into our other product, TokuMX.  It offers the same feature set and is not only useful for taking backups, but also to seed new instances for an existing replica set without any downtime.

6 thoughts

  1. Shlomi Noach says:

    This sounds like a general-purpose file snapshot that can be attached to any running linux process.
    It’s worth noting this kind of solution works well for transactional engines, but not for MyISAM; there is no escape from FLUSH TABLES when you deal with MyISAM — and MySQL’s system tables are MyISAM.

  2. Shlomi,

    Assuming it intercepts read() and write() syscalls, etc, then it will work for MyISAM. The OS is free to buffer the results of read() and write() calls but the SHIM doesn’t have to do so.

    The backup would end up being more consistent than the original in the case of a crash. The index, could, however be out of sync, because there is not log between the .MYI and .MYD file. So myisamchk would still be required.

  3. Does it work with async io? InnoDB uses libaio in 5.5+

  4. For InnoDB, I think libaio should not matter. Doublewrite buffer is written sync and the log itself is not async (it may be direct_io, but that should not be a problem). As long as the log data and doublewrite buffer is there, any lost or partially written async blocks will be corrected during recovery.

  5. gggeek says:

    I think you mean “transparent” when you say “opaque” :-)

Leave a Reply

Your email address will not be published. Required fields are marked *