Improving TPC-H-like queries – Q17

Posted On June 15, 2009 | By Bradley C. Kuszmaul | 9 comments

Executive Summary: A query like TPC-H Query 17 can be sped up by large factors by using straight_joins and clustering indexes. (This entry posted by Dave.)

In a previous post, we wrote about queries like TPC-H query 2, and the use of straight_join to improve performance.
This week, we consider Query 17, described by the TPC-H documentation as

“The Small-Quantity-Order Revenue Query considers parts of a given brand and with a given container type and determines the average lineitem quantity of such parts ordered for all orders (past and pending) in the 7-year database. What would the average yearly gross (undiscounted) loss in revenue if orders for these parts with a quantity of this average were no longer taken?”

Our initial run on Q17 (same hardware as before) timed out after 3600 seconds.

The MySQL query is:

select
        sum(l_extendedprice) / 7.0 as avg_yearly
from
        lineitem,
        part
where
        p_partkey = l_partkey
        and p_brand = 'Brand#33'
        and p_container = 'WRAP PACK'
        and l_quantity < (
                select
                        0.2 * avg(l_quantity)
                from
                        lineitem
                where
                        l_partkey = p_partkey
        );

The query plan was:

mysql> explain select sum(l_extendedprice) / 7.0 as avg_yearly from lineitem, part 
    -> where p_partkey = l_partkey and p_brand = 'Brand#33' and p_container = 'WRAP PACK' and l_quantity < (select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey);

+----+--------------------+----------+--------+---------------+---------+---------+-----------------------------------+----------+-------------+
| id | select_type        | table    | type   | possible_keys | key     | key_len | ref                               | rows     | Extra       |
+----+--------------------+----------+--------+---------------+---------+---------+-----------------------------------+----------+-------------+
|  1 | PRIMARY            | lineitem | ALL    | NULL          | NULL    | NULL    | NULL                              | 59986052 |             |
|  1 | PRIMARY            | part     | eq_ref | PRIMARY       | PRIMARY | 4       | tpch10G_tokudb.lineitem.l_partkey |        1 | Using where |
|  2 | DEPENDENT SUBQUERY | lineitem | ALL    | NULL          | NULL    | NULL    | NULL                              | 59986052 | Using where |
+----+--------------------+----------+--------+---------------+---------+---------+-----------------------------------+----------+-------------+

This plan does not take advantage of the selectivity of picking a specific brand and container to reduce the search set.
Using STRAIGHT_JOIN (as described in our post on Q2) and ordering the FROM clause to select the part table first sets the plan on a better course.

mysql> explain select straight_join sum(l_extendedprice) / 7.0 as avg_yearly from part, lineitem 
    -> where p_partkey = l_partkey and p_brand = 'Brand#33' and p_container = 'WRAP PACK' and l_quantity < (select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey);

+----+--------------------+----------+------+---------------+------+---------+------+----------+--------------------------------+
| id | select_type        | table    | type | possible_keys | key  | key_len | ref  | rows     | Extra                          |
+----+--------------------+----------+------+---------------+------+---------+------+----------+--------------------------------+
|  1 | PRIMARY            | part     | ALL  | PRIMARY       | NULL | NULL    | NULL |  2000000 | Using where                    |
|  1 | PRIMARY            | lineitem | ALL  | NULL          | NULL | NULL    | NULL | 59986052 | Using where; Using join buffer |
|  2 | DEPENDENT SUBQUERY | lineitem | ALL  | NULL          | NULL | NULL    | NULL | 59986052 | Using where                    |
+----+--------------------+----------+------+---------------+------+---------+------+----------+--------------------------------+

However, the query bogs down when searching the lineitem table.
Although an index was available in the LINEITEM table to quickly search the table (lineitem_fk3), the subsequent data lookup (along the primary key) caused a long(!) series of point queries.

Unlike many storage engines, TokuDB supports clustering indexes on any index (not just the primary key), and further supports clustering indexes on more than one index per table.

   mysql > create clustering index lineitem_ck3 on lineitem (l_partkey, l_suppkey);

This results in the following query/plan:

mysql> explain select straight_join sum(l_extendedprice) / 7.0 as avg_yearly from part, lineitem
    -> where p_partkey = l_partkey and p_brand = 'Brand#33' and p_container = 'WRAP PACK' and l_quantity < (select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey);
+----+--------------------+----------+------+---------------+--------------+---------+-------------------------------+---------+-------------+
| id | select_type        | table    | type | possible_keys | key          | key_len | ref                           | rows    | Extra       |
+----+--------------------+----------+------+---------------+--------------+---------+-------------------------------+---------+-------------+
|  1 | PRIMARY            | part     | ALL  | PRIMARY       | NULL         | NULL    | NULL                          | 2000000 | Using where | 
|  1 | PRIMARY            | lineitem | ref  | lineitem_ck3  | lineitem_ck3 | 4       | tpch10G_tokudb.part.p_partkey |  599860 | Using where | 
|  2 | DEPENDENT SUBQUERY | lineitem | ref  | lineitem_ck3  | lineitem_ck3 | 4       | tpch10G_tokudb.part.p_partkey | 5998606 |             | 
+----+--------------------+----------+------+---------------+--------------+---------+-------------------------------+---------+-------------+

Although it appears to look at more rows, each table is scanned efficiently down a key that provides all of the information needed to address the query.

We measured 101 sec on this query. Maybe some day we'll be patient enough to let the original plan complete, but we know this approach is at least 36 (3600/101) times faster.

9 thoughts

  1. Jay Pipes says:

    You should get much better performance if you rewrote the query to the following:

    select
    sum(li.l_extendedprice) / 7.0 as avg_yearly
    from lineitem li
    inner join part p
    on li.l_partkey = p.p_partkey
    inner join (
    select l_partkey, 0.2 * avg(l_quantity) as quantity
    from lineitem
    group by l_partkey
    ) as quantities
    on li.l_partkey = quantities.l_partkey
    and li.quantity < quantities.quantity
    where
    p.p_brand = ‘Brand#33′
    p.p_container = ‘WRAP PACK';

    and put an index on lineitem (l_partkey, l_quantity)

    Not sure if TPC-H allows you to optimize the terrible (for MySQL) queries it produces, but the above would likely cut the performance by a factor of 100 or more..

    I’d be interested to see what EXPLAIN comes up with for the above rewritten query + additional index.

    Cheers,

    Jay Pipes

    Cheers,

    Jay

  2. Dave says:

    Jay,

    Thanks for the ideas. Will try them in the next couple days and let you know how it goes.

    Dave

  3. Jay Pipes says:

    Hi! Any feedback on the query above?

    Cheers!

    Jay

  4. Dave says:

    Jay,

    I spent some time on your recommendations, and ran enough experiments that I thought it would be worth another post. It should be up shortly.

    Regarding the question of TPC-H rules for defining indexes – my point for the post was to show how clustering indexes can help, not how to get valid TPC-H performance measurement. “TPC-H-like” was used deliberately to avoid contending with the TPC rules. All good ideas are welcome here.

    Dave

  5. Jay Pipes says:

    Awesome, Dave! Looking forward to your post. And, yeah, as I understand it, TPC doesn’t allow for a number of things. This is one of the reasons I strongly support an open source, openly discussed, and vendor-neutral set of benchmarks :)

    Cheers!

    Jay

  6. In the future call it DBT-3 query #17. DBT-3 is the open source OSDB version of TPC-H(tm).

    I take it this was SF1? Looks like you have about 6M rows in the fact table.

  7. I added a key to:
    lineitem(l_partkey, l_quantity)
    and too
    part(p_brand, p_container)

    and then I analyzed both tables. Without any hints MySQL comes up with this:

    mysql> explain
    -> select sum(l_extendedprice) / 7.0 as avg_yearly
    -> from lineitem , part
    -> where p_partkey = l_partkey
    -> and p_brand = ‘Brand#33′
    -> and (p_container = ‘WRAP PACK’ )
    -> and l_quantity < (
    -> select 0.2 * avg(l_quantity)
    -> from lineitem
    -> where l_partkey = p_partkey
    -> );
    +—-+——————–+———-+——+—————————+————–+———+——————————+——+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+——————–+———-+——+—————————+————–+———+——————————+——+————-+
    | 1 | PRIMARY | part | ref | PRIMARY,p_brand | p_brand | 20 | const,const | 157 | Using where |
    | 1 | PRIMARY | lineitem | ref | lineitem_fk3,part_qty_idx | lineitem_fk3 | 4 | tpch1g_myisam.part.p_partkey | 2376 | Using where |
    | 2 | DEPENDENT SUBQUERY | lineitem | ref | lineitem_fk3,part_qty_idx | lineitem_fk3 | 4 | tpch1g_myisam.part.p_partkey | 2376 | |
    +—-+——————–+———-+——+—————————+————–+———+——————————+——+————-+
    3 rows in set (0.00 sec)

    mysql> select sum(l_extendedprice) / 7.0 as avg_yearly from lineitem , part where p_partkey = l_partkey and p_brand = ‘Brand#33′ and (p_container = ‘WRAP PACK’ ) and l_quantity < ( select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey );
    +—————+
    | avg_yearly |
    +—————+
    | 312967.325714 |
    +—————+
    1 row in set (0.23 sec)

    I don’t have a SF10 myisam lying around, but you might want to give it a whirl if you do.

  8. these settings may make a difference:
    key_buffer_size=4G
    read_buffer_size=16M
    tmp_table_size=256M
    max_heap_table_size=256M
    sort_buffer_size=16M

    This query plan doesn’t use a filesort so it shouldn’t generate a temporary table. It should scale pretty well as long as key_buffer_size is >= the size of the new composite index on the lineitem table. You can check the size of that index with the information schema.

    Setting PACK_KEYS=1 on lineitem will probably result in even improved results as you will get compression on repeating integer key values in the composite key.

  9. […] An alternate approach, offered in response to our original post, provides excellent improvements for smaller databases, but clustered indexes offer better […]

Leave a Reply

Your email address will not be published. Required fields are marked *