We recently made transactions in TokuDB 3.0 durable. We write table changes into a log file so that in the event of a crash, the table changes up to the last checkpoint can be replayed. Durability requires the log file to be fsync’ed when a transaction is committed. Unfortunately, fsync’s are not free, and may cost 10′s of milliseconds of time. This may seriously affect the insertion rate into a TokuDB table. How can one achieve high insertion rates in TokuDB with durable transactions?
Decrease the fsync cost
The fsync of the TokuDB log file writes all of the dirty log file data that is cached in memory by the operating system to the underlying storage system. The fsync time can be modeled with a simple linear equation: fsync time = N/R + K, where N is the amount of dirty data that needs to by written to disk, R is the disk write rate, and K is a constant time defined by the storage system, such as disk seek time. We want to to minimize the fsync time. Note that this model is conceptual and has not been verified by experiment, but it is good enough to identify some opportunities for decreasing the fsync cost.
One can increase the bandwidth of the log device (R). Suppose that large rows are being inserted into the database. A very large amount of log file data may be cached in memory by the operating system before the fsync. A storage system, perhaps a striped RAID, with a high write bandwidth will be able to store this log data quickly.
One can decrease the amount of data (N) that must be written to the log. There are several techniques that may be used here by TokuDB, including logging table rather than dictionary changes, and compressing the log files. These techniques will be shipped in a future TokuDB release.
One can decrease the constant fsync cost (K). A battery backed up RAID may speed up fsync’s since it writes data to non-volatile memory that is faster than the disks in the RAID.
One can put the TokuDB logs on their own storage system by using the tokudb_log_dir MySQL system variable. This will increase the overall system write bandwidth, and also eliminate contention between the TokuDB log and the TokuDB fractal tree.
Amortize the fsync cost with large transactions
When the fsync time is a significant fraction of the time to execute a transaction, the insertion rate can be increased by using larger transactions.
Does this actually work? We ran iibench with tokudb_commit_sync ON and 1000 rows per transaction and measured over 10K rows/second insertion rate at 250M rows on a Sun Fire X4150. For 10,000 rows per transaction, we measured 15K rows/sec insertion rate.
Several MySQL storage engines provide mechanisms that relax durability by decoupling the fsync from the transaction commit. TokuDB provides the tokudb_commit_sync MySQL session variable, which works as follows.
If tokudb_commit_sync=ON, then the TokuDB log file is fsync’ed when the transaction commits. When used in this way, all transactions are durable. This is the default setting.
If tokudb_commit_sync=OFF, then the TokuDB log file is not fsync’ed when the transaction commits. When used in this way, the TokuDB tables recover to a transactionally consistent state that may not include the transactions committed after the last TokuDB checkpoint. It all depends on when the TokuDB log was last fsync’ed.
The TokuDB log consists of a sequence of log files, each of which is about 100MB in size. TokuDB will fsync a log file whenever it fills one up and needs to create the next one. The TokuDB log will also be fsync’ed by a commit of a transaction in another MySQL client connection that has its tokudb_commit_sync session variable set ON.
How does a TokuDB checkpoint work? TokuDB checkpoints open dictionaries every 60 seconds by taking a snapshot of the current state of the dictionaries, writing all of the dirty dictionary data to disk, fsyncing the dictionary files, and finally fsyncing the TokuDB log.
We ran the iibench with tokudb_commit_sync OFF and measured over 17K rows/second insertion rate at 250M rows on a Sun Fire X4150.
The tokudb_commit_sync session variable may be used to implement application defined durability. For example, the application can set the tokudb_commit_sync session variable ON once per second rather than for every transaction. The effect will be one second worth of transaction vulnerability.
Durable transactions in TokuDB increase the amount of data written to the storage system. This may make the storage system a performance bottleneck. If this is the case, one may consider using a storage system for the TokuDB logs with higher write bandwidth. One may use larger transactions to amortize the fsync cost over a larger number of rows. One may also entertain relaxing the durability requirements of the application and control the fsync’s by the application.