Another look at improving TPC-H-like queries – Q17

Summary: An alternate approach, offered in response to our original post, provides excellent improvements for smaller databases, but clustered indexes offer better performance as database size increases. (This posting is by Dave.)

Jay Pipes suggested an alternate approach to improving MySQL performance of Query 17 on a TPC-H-like database.

  1. Add the index (l_partkey, l_quantity) to the lineitem table.
  2. Re-write the query as:
    select 
       sum(li.l_extendedprice) / 7.0 as avg_yearly 
    from lineitem li 
       inner join part p on li.l_partkey = p.p_partkey 
       inner join ( select 
                       l_partkey, 0.2 * avg(l_quantity) as quantity 
                    from lineitem 
                    group by l_partkey 
                  ) as quantities 
          on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity 
    where 
       p.p_brand = 'Brand#33'
       p.p_container = 'WRAP PACK';
    

I ran this suggestion on MySQL with Tokutek's TokuDB storage engine (as before) against a 10G Scale Factor database. After waiting several hours for the EXPLAIN to complete, I punted and moved to a smaller, 1G Scale Factor database. This database nearly fits in memory, and allowed me to examine the query plan

Create the index:

mysql> alter table lineitem add index li_pkey_quan_idx(l_partkey, l_quantity);

Explain:

mysql> explain select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li
    -> inner join part p on li.l_partkey = p.p_partkey
    -> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem group by l_partkey ) as quantities
    -> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity
    -> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';
+----+-------------+------------+--------+-------------------------------+--------------+---------+----------------------+---------+-------------+
| id | select_type | table      | type   | possible_keys                 | key          | key_len | ref                  | rows    | Extra       |
+----+-------------+------------+--------+-------------------------------+--------------+---------+----------------------+---------+-------------+
|  1 | PRIMARY     | derived2   | ALL    | NULL                          | NULL         | NULL    | NULL                 |  200000 |             | 
|  1 | PRIMARY     | p          | eq_ref | PRIMARY                       | PRIMARY      | 4       | quantities.l_partkey |       1 | Using where | 
|  1 | PRIMARY     | li         | ref    | lineitem_fk3,li_pkey_quan_idx | lineitem_fk3 | 4       | quantities.l_partkey |   60012 | Using where | 
|  2 | DERIVED     | lineitem   | index  | NULL                          | lineitem_fk3 | 8       | NULL                 | 6001215 |             | 
+----+-------------+------------+--------+-------------------------------+--------------+---------+----------------------+---------+-------------+
4 rows in set (1 min 18.85 sec)

Notice that although the planner sees the li_pkey_quan_idx, it does not use it.

Run the Query:

mysql> select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li
    -> inner join part p on li.l_partkey = p.p_partkey
    -> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem group by l_partkey ) as quantities
    -> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity
    -> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';
+---------------+
| avg_yearly    |
+---------------+
| 312967.325714 | 
+---------------+
1 row in set (1 min 6.73 sec)

The resulting query time (66.73s) is an improvement on the default query (155.83s). Nonetheless, when you run this query against the larger, 10G Scale Factor (SF) database, it does not complete after several hours. I observed that the intermediate tables can be held in memory for SF = 1G, but are written to disk for SF = 10G. This is a well-known peril of intermediate tables.

Undeterred, I tried forcing the query planner to make use of the li_pkey_quan_idx. I found that telling the planner to use the index in both the 3rd and 4th line of plan, I could get substantial speed-ups:

Explain:

mysql> explain select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li use index (li_pkey_quan_idx)
    -> inner join part p on li.l_partkey = p.p_partkey
    -> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem use index (li_pkey_quan_idx) group by l_partkey ) as quantities
    -> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity
    -> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';
+----+-------------+------------+--------+------------------+------------------+---------+----------------------+---------+-------------+
| id | select_type | table      | type   | possible_keys    | key              | key_len | ref                  | rows    | Extra       |
+----+-------------+------------+--------+------------------+------------------+---------+----------------------+---------+-------------+
|  1 | PRIMARY     | derived2   | ALL    | NULL             | NULL             | NULL    | NULL                 |  200000 |             | 
|  1 | PRIMARY     | p          | eq_ref | PRIMARY          | PRIMARY          | 4       | quantities.l_partkey |       1 | Using where | 
|  1 | PRIMARY     | li         | ref    | li_pkey_quan_idx | li_pkey_quan_idx | 4       | quantities.l_partkey |   60012 | Using where | 
|  2 | DERIVED     | lineitem   | index  | NULL             | li_pkey_quan_idx | 11      | NULL                 | 6001215 | Using index | 
+----+-------------+------------+--------+------------------+------------------+---------+----------------------+---------+-------------+
4 rows in set (5.74 sec)

Query:

mysql> select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li use index (li_pkey_quan_idx)
    -> inner join part p on li.l_partkey = p.p_partkey
    -> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem use index (li_pkey_quan_idx) group by l_partkey ) as quantities
    -> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity
    -> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';
+---------------+
| avg_yearly    |
+---------------+
| 312967.325714 | 
+---------------+
1 row in set (12.38 sec)

Now we're talking. 12.38s vs. 155.83s (12x improvement). I'm sure this is closer to what Jay was hoping for. Emboldened, I ran the modified query against the SF = 10G database:

mysql> use tpch10G_tokudb;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> explain select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li use index (li_pkey_quan_idx)
    -> inner join part p on li.l_partkey = p.p_partkey
    -> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem use index (li_pkey_quan_idx) group by l_partkey ) as quantities
    -> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity
    -> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';
+----+-------------+------------+--------+------------------+------------------+---------+----------------------+----------+-------------+
| id | select_type | table      | type   | possible_keys    | key              | key_len | ref                  | rows     | Extra       |
+----+-------------+------------+--------+------------------+------------------+---------+----------------------+----------+-------------+
|  1 | PRIMARY     | derived2   | ALL    | NULL             | NULL             | NULL    | NULL                 |  2000000 |             | 
|  1 | PRIMARY     | p          | eq_ref | PRIMARY          | PRIMARY          | 4       | quantities.l_partkey |        1 | Using where | 
|  1 | PRIMARY     | li         | ref    | li_pkey_quan_idx | li_pkey_quan_idx | 4       | quantities.l_partkey |   599860 | Using where | 
|  2 | DERIVED     | lineitem   | index  | NULL             | li_pkey_quan_idx | 11      | NULL                 | 59986052 | Using index | 
+----+-------------+------------+--------+------------------+------------------+---------+----------------------+----------+-------------+
4 rows in set (57.61 sec)

mysql> select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li use index (li_pkey_quan_idx)
    -> inner join part p on li.l_partkey = p.p_partkey
    -> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem use index (li_pkey_quan_idx) group by l_partkey ) as quantities
    -> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity
    -> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';
+----------------+
| avg_yearly     |
+----------------+
| 3253728.844286 | 
+----------------+
1 row in set (1 hour 7 min 8.55 sec)

So Jay's ideas offer a solution that does not involve clustering keys and produces a result in a manageable timeframe. However, relative to clustering keys, it is still 4028 / 101 = 40x slower.

No Tags.

3 Responses to Another look at improving TPC-H-like queries – Q17

  1. Hi ppl,

    “An alternate approach, offered in response to our original post,”

    Do you mean this:

    http://blogs.tokutek.com/tokuview/improving_tpc_h_like_queries_q17/

    ?

  2. Dave says:

    Roland,

    Yes – that’s what I meant.

    Dave

  3. Jay Pipes says:

    Hi Dave!

    Great post! Yes, that’s around the speedup I expected. It’s a shame that you have to nudge the optimizer into using the compound index, but I’m glad you persevered and added the index hint :)

    Cheers, and thanks for the post!

    Jay

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>