Attempting to Quantify Fragmentation Effects

Posted On September 18, 2009 | By Tokutek | 4 comments

We often hear from customers and MySQL experts that fragmentation causes problems such as wasting disk space, increasing backup times, and degrading performance. Typical remedies include periodic “optimize table” or dump and re-load (for example, see Project Golden Gate). Unfortunately, these techniques impact database availability and/or require additional administrative cost and complexity. Tokutek’s Fractal Tree algorithms do not not cause fragmentation, and we’re looking for ways to measure the effects of fragmentation to quantify TokuDB’s benefits.

I ran some tests using the iiBench benchmark as an experiment to try and quantify the impact of fragmentation, and observed some interesting results.

Initial Load – 50M Rows

I created an iiBench table with 50M rows, and recorded how long it took to complete along with the amount of disk space used by the data and indexes; log files are *not* included in the reported disk use.

$ iibench.py --setup --insert_only --engine=[innodb | tokudb] --max_rows=50000000 --table_name=t1
Operation Time
TokuDB
Time
InnoDB
Disk Use
TokuDB
Disk Use
InnoDB
Insert 50M Rows 3,349s 28,821s 3.3GiB 11GiB
select count(*) 56s 643s 3.4GiB 11GiB
select count(*) 19.0s 14.4s 3.5GiB 11GiB
optimize table 153s 28,456s 4.5GiB 20GiB

select count(*) was run twice to see how fast it ran with the cache warmed up (the query cache was off).

When inserting into TokuDB, some data may remain in internal data structures, so optimize table was run to push all of the data to disk for an accurate measure of disk use. Initially, I thought running optimize table would not be necessary for InnoDB, but I ran it to ensure a consistent procedure on both engines. Surprisingly, InnoDB used much more disk space after running optimize table. According to this post, InnoDB’s insert buffer is stored in the tablespace, so data in the insert buffer should have been included in the measured disk use before optimizing the table. I’m not sure what caused such a large increase in disk use.

Deleting 10M Rows

mysql> delete from t1 where transactionid
Operation Time
TokuDB
Time
InnoDB
Disk Use
TokuDB
Disk Use
InnoDB
Delete 10M rows 1,050s 169,238s 4.7GiB 20GiB
select count(*) 42s 772s 4.7GiB 20GiB
select count(*) 15.6s 809s 4.7GiB 20GiB
select count(*) 14.7s 802s 4.7GiB 20GiB
optimize table 129s 28,539s 4.6GiB 20GiB
select count(*) 16.2s 372s 4.6GiB 20GiB
select count(*) 13.8s 11.6s 4.6GiB 20GiB

Deleting 10M rows took about 17 minutes on TokuDB, and over 47 hours on InnoDB. After deleting the rows, select count(*) on InnoDB ran slow, even after attempting to warm up the cache. I ran it three times, and it was consistently slow. I ran optimize table, and select count(*) ran much faster, suggesting that the problem may have been caused by fragmentation. TokuDB does not fragment, so select count(*) ran fast on TokuDB after deleting rows, with no need to optimize.

Summary

Deleting 10M rows from a 50M row table caused the time to run select count(*) on InnoDB to increase by a factor of about 55. Running optimize table solved the problem, but it took almost 8 hours to complete. Dumping and reloading is likely to be faster than optimize or alter, but it still takes time and effort.

After deleting 10M rows on TokuDB, the time to run select count(*) decreased from about 19s to about 15s (proportional to the decrease from 50M to 40M rows in the table), without a need for optimizing or dumping and reloading.

Going Further

Posts from Mark Callaghan and Bradley C. Kuszmaul show that iiBench’s linear distribution of data does not provide a good model of some real world data sets, and a Zipfian distribution is probably a better model. It would be interesting to re-run the experiment with an updated version of iiBench using a Zipfian distribution. Running similar delete experiments on large real world data sets would be interesting as well.

Additional Details

I ran the tests on a machine with a modest amount of memory and slow cores by today’s standards.

  • CentOS 5.1
  • Dell PowerEdge 2950
  • 2 Socket, Quad Core Intel Xeon 1.6GHz
  • 4GB Main Memory (2GB InnoDB Buffer Pool, 2GB TokuDB Cache)
  • 5 disk SW RAID5 1TB SATA
  • ext3 Filesystem

I used default parameters for TokuDB and the InnoDB parameters are shown in the my.cnf file below. By default, TokuDB uses 1/2 of physical memory (2GB) for it’s cache size, so the InnoDB buffer pool was set to 2GB for a fair comparison. It may be possible to achieve better InnoDB results through tuning, but the goal of this exercise was to search for ways to quantify the impacts of fragmentation.

[mysqld]

innodb_log_buffer_size=4M
innodb_thread_concurrency=8
innodb_log_files_in_group=3
innodb_log_file_size=1300M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_buffer_pool_size=2G
innodb_max_dirty_pages_pct=90

4 thoughts

  1. Baron says:

    How did you measure disk space for InnoDB? If you looked at the size of the files on disk, it would report the disk space used for the entire tablespace, which would have grown to accommodate the temp table created by OPTIMIZE. The tablespace doesn’t shrink thereafter, but SHOW TABLE STATUS should show the true size of the table and indexes. I would expect it to be about 10GiB in total, with about 10GiB free space in the tablespace.

  2. Mark Callaghan says:

    You should measure the time for select count(*) queries that are index-only on secondary indexes. That will show the impact of fragmentation from InnoDB.

    Optimize table for InnoDB (prior to the plugin) does alter table which copies the table to rebuild it and all indexes. Thus the space used by the innodb datafile doubles. That space would be used (reclaimed) were you to insert more rows.

    Rebuilding an InnoDB table (prior to the plugin) only defragments the primary key index as inserts during the rebuild are done in PK order. Secondary indexes are likely to be as fragmented after the rebuild as they were prior. With the innodb plugin and the fast index creation feature, secondary indexes can be created by an external sort and can be defragmented.

    As inserts are done in PK order for iibench, there is no benefit from using optimize table for innodb.

    Percona has a tool that reports on fragmentation per index — http://www.mysqlperformanceblog.com/2009/09/14/statistics-of-innodb-tables-and-indexes-available-in-xtrabackup/

    SELECT statements can be slower than expected when there is a lot of purge lag which might occur after a lot of rows have been deleted — http://mysqlha.blogspot.com/2008/07/how-do-you-know-when-innodb-gets-behind.html

  3. Baron and Mark,

    Thanks for the expert input. I’m reporting the size of the ibdata1 tablespace file, so the table copy during the optimize step explains the increase. When I get the time, I’ll repeat the experiment collecting additional data from SHOW TABLE STATUS, Percona’s xtrabackup statistics, and I will run some queries that use the secondary indexes. I’ll also do some investigation into purge lag as well.

  4. Petrik_CZ says:

    you should not use RAID5 for DB, it has tragic random write latency, very bad for db. Thanks for nice comparison of these two db engines!!