Loading Air Traffic Control Data with TokuDB 4.1.1

Posted On August 27, 2010 | By Rich Prohaska | 7 comments

TokuDB has a big advantage over B-trees when trickle loading data into existing tables. However, it is possible to preprocess the data when bulk loading into empty tables or when new indexes are created. TokuDB release 4 now uses a parallel algorithm to speed up these types of bulk insertions. How does the parallel loader performance compare with the serial loader? We use the Air Traffic Control (ATC) data and queries described in a Percona blog and also used in an experiment with TokuDB 2.1.0 to gain some insight.

Our ATC data is about 122M rows in size, is stored in a 40GiB CSV file, and can be found in our Amazon S3 public bucket. See the end of this blog for details. We use a table schema with 8 indices to speed up the ATC queries. The loader inserts the data into the primary fractal tree and one fractal tree for each of the 8 keys.

The load was run on two (old) machines:

  • 2 core: Intel Core 2 Duo E8500 3.16GHz, 5GiB RAM, 1 disk SATA 1TB.
  • 8 core: 2 socket quad core Xeon X5460 3.16GHz, 16GiB RAM, 6 disk SAS 1TB RAID0.

Load times:

    Load Times for the ATC Database
    TokuDB 2.1.0 and MySQL 5.1.36 TokuDB 4.1.1 and MySQL 5.1.46 TokuDB Speedup
    2 core 10,974s 5,201s 2.1x
    8 core 11,286s 2,655s 4.2x
  • We use “LOAD DATA INFILE” to load the CSV file into a MySQL table. The CSV file can be found in our Amazon S3 public bucket.
  • We see over 4x load time speed up for this database. This compares favorably with the other loaders. The next steps are to run experiments on machines with a larger number of cores, identify bottlenecks, and remove them.
  • BTW, the ATC load times into MyISAM are roughly the same as seen with the TokuDB 2.1.0 serial loader. Perhaps loads into MyISAM can be made faster with some twiddling of the MySQL system variables.

TokuDB data sizes (including indices):

  • 6.7 GiB, or almost 6 times smaller than the CSV source file. The parallel loader uses the same compression parameters as the serial loader, so no change in size occurred.

Query times:

    Query Times for the ATC Database
    TokuDB 2.1.0 TokuDB 4.1.1
    Q0 22s 28s
    Q1 67s 33s
    Q2 20s 24s
    Q3 54s 29s
    Q4 2s 1s
    Q5 11s 7s
    Q6 35s 29s
    Q7 32s 39s
    Q8.1y 7s 4s
    Q8.2y 42s 32s
    Q8.3y 37s 33s
    Q8.4y 38s 35s
    Q8.10y 139s 56s
    Q9 36s 35s
  • TokuDB 4 uses the standard MySQL handler row iterator, so the query times are roughly unchanged. In addition, the query handling is not yet parallelized, so there is not much difference in query times on the 2 core and 8 core machines.
  • BTW, the query times for MyISAM are slightly longer than TokuDB.
  • The queries can be found in our Amazon S3 public bucket.

Future blogs:

  • Loader scalability on systems with larger number of cores. Machines with 48 cores (4 socket, 12 cores per socket) are now reasonably priced.
  • How we implemented the parallel loader: algorithms and parallel runtime.

The ATC CSV data files, the schema, and the queries can be retrieved from our public Amazon S3 bucket called
tokutek-pub.  Here are the Amazon S3 keys:

  • atc_On_Time_Performance.mysql.csv.gz.aa
  • atc_On_Time_Performance.mysql.csv.gz.ab
  • atc_On_Time_Performance.mysql.csv.gz.ac
  • atc_On_Time_Performance.mysql.csv.gz.ad
  • atc_On_Time_Performance.mysql.csv.gz.xml
  • atc_ontime_create_covered.sql
  • atc_ontime_create_covered.sql.xml
  • atc_ontime_queries.tar.gz
  • atc_ontime_queries.tar.gz.xml
  • s3get
  • s3get.xml

7 thoughts

  1. Jeremy Cole says:

    Hi,

    Just a minor nit-pick: This is not ATC data, it is from the Bureau of Transportation Statistics (BTS) and it is self-reported by commercial airlines over a certain size. It has nothing to do with Air Traffic Control at all, it is designed only for analyzing the performance of air traffic as a transportation mechanism.

    Regards,

    Jeremy

  2. Does the loader produce binary logs?

  3. Rich Prohaska says:

    The tokudb loader operates under the storage engine handler’s start_bulk_insert, write_row, and end_bulk_insert methods. For index creation, the loader operates under the add_index method. Because of this, MySQL is able to do the binary logging. The tokudb loader is part of our storage engine, not a separate program.

  4. hadar says:

    You said the parallel loader is good for ” bulk loading into empty tables “,
    what happens when I do bulk loading using “load data infile into” for an existing table with records? the parallel loader will not work?

  5. [...] computing resources of the machine to presort and insert the data. In the last couple of posts (here and here), Rich and Dave presented performance results of TokuDB’s loader. Comparing load [...]

  6. zardosht says:

    @hadar, I wrote a post on your question: http://tokutek.com/2010/09/scenarios-where-tokudbs-loader-is-used/. If you use “load data infile” on a table with records already in it, then the “bulk load into an empty table” algorithm is not used, and normal insertions are done. Keep in mind that these normal insertions can be up to two orders of magnitude faster than insertions into a B-tree based storage engine.

  7. Martin Farach-Colton says:

    Fractal trees are both relatively new and proprietary to Tokutek.

Leave a Reply

Your email address will not be published. Required fields are marked *