<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Long Index Keys</title>
	<atom:link href="http://www.tokutek.com/2009/06/long_index_keys/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.tokutek.com/2009/06/long_index_keys/</link>
	<description></description>
	<lastBuildDate>Thu, 02 Feb 2012 04:10:04 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: Zardosht Kasheff</title>
		<link>http://www.tokutek.com/2009/06/long_index_keys/#comment-96</link>
		<dc:creator>Zardosht Kasheff</dc:creator>
		<pubDate>Thu, 04 Jun 2009 03:38:37 +0000</pubDate>
		<guid isPermaLink="false">http://long_index_keys#comment-96</guid>
		<description>Roland,

This is a problem a customer presented us. I do not know other scenarios under which this table is used, so we cannot comment on what the primary key should be.

Regardless, I agree with your point that tables should have primary keys defined</description>
		<content:encoded><![CDATA[<p>Roland,</p>
<p>This is a problem a customer presented us. I do not know other scenarios under which this table is used, so we cannot comment on what the primary key should be.</p>
<p>Regardless, I agree with your point that tables should have primary keys defined</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Roland Bouman</title>
		<link>http://www.tokutek.com/2009/06/long_index_keys/#comment-99</link>
		<dc:creator>Roland Bouman</dc:creator>
		<pubDate>Thu, 04 Jun 2009 01:00:38 +0000</pubDate>
		<guid isPermaLink="false">http://long_index_keys#comment-99</guid>
		<description>Zardosht, 

thank you so much for the update. I&#039;m glad to see the query suggestions did a lot to improve performance. 

I am still wondering about the primary key for the atable table. In particular, I am wondering whether my assumptions that is was  &#123;id, off, len, name&#125;  is correct. (I mean conceptual pk - I didn&#039;t realize at first but you guys were correct to point out that name can&#039;t fit into a MyISAM index)</description>
		<content:encoded><![CDATA[<p>Zardosht, </p>
<p>thank you so much for the update. I&#8217;m glad to see the query suggestions did a lot to improve performance. </p>
<p>I am still wondering about the primary key for the atable table. In particular, I am wondering whether my assumptions that is was  &#123;id, off, len, name&#125;  is correct. (I mean conceptual pk &#8211; I didn&#8217;t realize at first but you guys were correct to point out that name can&#8217;t fit into a MyISAM index)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Zardosht Kasheff</title>
		<link>http://www.tokutek.com/2009/06/long_index_keys/#comment-98</link>
		<dc:creator>Zardosht Kasheff</dc:creator>
		<pubDate>Thu, 04 Jun 2009 00:56:04 +0000</pubDate>
		<guid isPermaLink="false">http://long_index_keys#comment-98</guid>
		<description>I agree with Steven that having a 1K â€œkeyâ€ is somewhat dubious. Â But that&#039;s the way covering indexes are implemented in MySQL. Â 

A richer syntax, in which some fields would be part of the key definition in a secondary index (in this case only (id, off, len) are needed to define the sort order) and other fields (name, in this case) are used to cover the query, would be cleaner, and you&#039;d end up with a smaller key. Â 

There are other databases that allow for such secondary index definitions. Â I discussed this somewhat in a blog posting on clustering indexes. Â 

So I think that in the MySQL case, long keys are the price one pays for covering indexes.</description>
		<content:encoded><![CDATA[<p>I agree with Steven that having a 1K â€œkeyâ€ is somewhat dubious. Â But that&#8217;s the way covering indexes are implemented in MySQL. Â </p>
<p>A richer syntax, in which some fields would be part of the key definition in a secondary index (in this case only (id, off, len) are needed to define the sort order) and other fields (name, in this case) are used to cover the query, would be cleaner, and you&#8217;d end up with a smaller key. Â </p>
<p>There are other databases that allow for such secondary index definitions. Â I discussed this somewhat in a blog posting on clustering indexes. Â </p>
<p>So I think that in the MySQL case, long keys are the price one pays for covering indexes.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Zardosht Kasheff</title>
		<link>http://www.tokutek.com/2009/06/long_index_keys/#comment-97</link>
		<dc:creator>Zardosht Kasheff</dc:creator>
		<pubDate>Thu, 04 Jun 2009 00:54:37 +0000</pubDate>
		<guid isPermaLink="false">http://long_index_keys#comment-97</guid>
		<description>Ok, so the table got all screwy.  Here&#039;s the data in a more readable form:

O is for the original posting.
BR is for Bjorn/Roland.
S is for Steven.

Original Bjorn/Roland Steven 
MyISAM non-covering 
O: 17.149s  BR: 3.38s S: 9.02s 

TokuDB non-covering 
BR: 3.63s S: 9.16s
 
TokuDB covering 
O: 4.9s BR: 1.36s S: 7.02s

Not the easiest to read, and if I figure out to post tables I will.</description>
		<content:encoded><![CDATA[<p>Ok, so the table got all screwy.  Here&#8217;s the data in a more readable form:</p>
<p>O is for the original posting.<br />
BR is for Bjorn/Roland.<br />
S is for Steven.</p>
<p>Original Bjorn/Roland Steven<br />
MyISAM non-covering<br />
O: 17.149s  BR: 3.38s S: 9.02s </p>
<p>TokuDB non-covering<br />
BR: 3.63s S: 9.16s</p>
<p>TokuDB covering<br />
O: 4.9s BR: 1.36s S: 7.02s</p>
<p>Not the easiest to read, and if I figure out to post tables I will.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Zardosht Kasheff</title>
		<link>http://www.tokutek.com/2009/06/long_index_keys/#comment-100</link>
		<dc:creator>Zardosht Kasheff</dc:creator>
		<pubDate>Thu, 04 Jun 2009 00:05:27 +0000</pubDate>
		<guid isPermaLink="false">http://long_index_keys#comment-100</guid>
		<description>I tried the suggested query rewrites.  For each, I tried three experiments: a non-covering key of (id, off, len) for MyISAM, the same non-covering key for TokuDB, and a covering key of (id, off, len, name) for TokuDB.  The results are summarized in the following table:

                    Original        Bjorn/Roland    Steven
MyISAM non-covering    17.149s        3.38s        9.02s
TokuDB non-covering                    3.63s        9.16s
TokuDB covering        4.9s            1.36s        7.02s

A couple of notes: (1) Bjornâ€™s and Rolandâ€™s suggestions are almost identical, and the gave the same times.  (2) Steven reports that he gets the same running time with MyISAM both with and without the non-covering index defined about.  My results are very different:  MyISAM with no index takes 84s.  I donâ€™t know how to explain the discrepancy.

Conclusion:  (1) Query optimization makes a big difference in the run time.  (2) Covering indexes also can make a big difference.</description>
		<content:encoded><![CDATA[<p>I tried the suggested query rewrites.  For each, I tried three experiments: a non-covering key of (id, off, len) for MyISAM, the same non-covering key for TokuDB, and a covering key of (id, off, len, name) for TokuDB.  The results are summarized in the following table:</p>
<p>                    Original        Bjorn/Roland    Steven<br />
MyISAM non-covering    17.149s        3.38s        9.02s<br />
TokuDB non-covering                    3.63s        9.16s<br />
TokuDB covering        4.9s            1.36s        7.02s</p>
<p>A couple of notes: (1) Bjornâ€™s and Rolandâ€™s suggestions are almost identical, and the gave the same times.  (2) Steven reports that he gets the same running time with MyISAM both with and without the non-covering index defined about.  My results are very different:  MyISAM with no index takes 84s.  I donâ€™t know how to explain the discrepancy.</p>
<p>Conclusion:  (1) Query optimization makes a big difference in the run time.  (2) Covering indexes also can make a big difference.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tom Hotchkiss</title>
		<link>http://www.tokutek.com/2009/06/long_index_keys/#comment-102</link>
		<dc:creator>Tom Hotchkiss</dc:creator>
		<pubDate>Wed, 03 Jun 2009 03:34:47 +0000</pubDate>
		<guid isPermaLink="false">http://long_index_keys#comment-102</guid>
		<description>Hey Roland, sorry for the quarantine - you&#039;re comment is much appreciated and most definitely *not* SPAM.  Not sure why it got flagged, but it&#039;s up there now as the first comment.

Many thanks to everyone else for the great comments!  We&#039;re running tests on the actual data with the various suggestions and we&#039;ll post the results soon.</description>
		<content:encoded><![CDATA[<p>Hey Roland, sorry for the quarantine &#8211; you&#8217;re comment is much appreciated and most definitely *not* SPAM.  Not sure why it got flagged, but it&#8217;s up there now as the first comment.</p>
<p>Many thanks to everyone else for the great comments!  We&#8217;re running tests on the actual data with the various suggestions and we&#8217;ll post the results soon.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Roland Bouman</title>
		<link>http://www.tokutek.com/2009/06/long_index_keys/#comment-101</link>
		<dc:creator>Roland Bouman</dc:creator>
		<pubDate>Wed, 03 Jun 2009 01:59:12 +0000</pubDate>
		<guid isPermaLink="false">http://long_index_keys#comment-101</guid>
		<description>Hey guys, are you still hoarding my comment? When I posted it I was told it was awaiting moderation because it was labelled as spam.</description>
		<content:encoded><![CDATA[<p>Hey guys, are you still hoarding my comment? When I posted it I was told it was awaiting moderation because it was labelled as spam.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Steven</title>
		<link>http://www.tokutek.com/2009/06/long_index_keys/#comment-103</link>
		<dc:creator>Steven</dc:creator>
		<pubDate>Tue, 02 Jun 2009 22:36:52 +0000</pubDate>
		<guid isPermaLink="false">http://long_index_keys#comment-103</guid>
		<description>Swapping out MySql engines is a powerful concept -- but horribly misapplied when being used to mask bad design and/or lack of SQL knowledge. I can think of several instances where I&#039;ve wanted to index a 1k key -- but none that I could justify.

The design problems start in the base table, but sometimes those issues are harder to fix in the short term. Here a quick fix is to think about what the query actually has to do in order to produce the required result set. Most importantly we don&#039;t need to join each and every combination - only the unique combinations. The following query works well with or without an additional index. Slightly better with.

SELECT a.name,
       Count(pd1.name) AS CountOfe2
 FROM
    atable a
 inner join (select
        name,id,offset,len
     from
        atable
     group by
        name,id,offset,len
      ) pd1
ON  (a.id = pd1.id)
AND (a.offset = pd1.offset)
AND (a.len = pd1.len)
WHERE ((a.name&lt;&gt;pd1.name))
group by a.name


NOTES:
1) I had to fabricate a dataset, different datasets produce different times. I had to come up with a moderately poor dataset since several sets caused MySql to take in excess of 5 minutes.
2) The original query took 174 seconds on my windows laptop (dual core 2ghz w/2gb memory)  running mysql 5.1.
3) The modified query took .984 seconds without a composite index and .928 seconds with it.
4) SQL Server managed 34 seconds for the original query without a composite index and less than a second for all other combinations.
5) Both the original query and the modified query return the same resultset.</description>
		<content:encoded><![CDATA[<p>Swapping out MySql engines is a powerful concept &#8212; but horribly misapplied when being used to mask bad design and/or lack of SQL knowledge. I can think of several instances where I&#8217;ve wanted to index a 1k key &#8212; but none that I could justify.</p>
<p>The design problems start in the base table, but sometimes those issues are harder to fix in the short term. Here a quick fix is to think about what the query actually has to do in order to produce the required result set. Most importantly we don&#8217;t need to join each and every combination &#8211; only the unique combinations. The following query works well with or without an additional index. Slightly better with.</p>
<p>SELECT a.name,<br />
       Count(pd1.name) AS CountOfe2<br />
 FROM<br />
    atable a<br />
 inner join (select<br />
        name,id,offset,len<br />
     from<br />
        atable<br />
     group by<br />
        name,id,offset,len<br />
      ) pd1<br />
ON  (a.id = pd1.id)<br />
AND (a.offset = pd1.offset)<br />
AND (a.len = pd1.len)<br />
WHERE ((a.name<>pd1.name))<br />
group by a.name</p>
<p>NOTES:<br />
1) I had to fabricate a dataset, different datasets produce different times. I had to come up with a moderately poor dataset since several sets caused MySql to take in excess of 5 minutes.<br />
2) The original query took 174 seconds on my windows laptop (dual core 2ghz w/2gb memory)  running mysql 5.1.<br />
3) The modified query took .984 seconds without a composite index and .928 seconds with it.<br />
4) SQL Server managed 34 seconds for the original query without a composite index and less than a second for all other combinations.<br />
5) Both the original query and the modified query return the same resultset.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: PaulM</title>
		<link>http://www.tokutek.com/2009/06/long_index_keys/#comment-104</link>
		<dc:creator>PaulM</dc:creator>
		<pubDate>Tue, 02 Jun 2009 09:32:45 +0000</pubDate>
		<guid isPermaLink="false">http://long_index_keys#comment-104</guid>
		<description>Got to love blogs and comments eh. You have solid comments. Please retry if you have time and update the post.
The other issue with your rewrite, is you have split 1 SQL into 5 pieces of SQL... what happens if the data changes.

The original SQL with its nested scalar queries looks to have written by someone with a background in a database which has a optimizer which can handle this better.

When I see this type of join the table on itself stuff it is screaming poor design. You shouldn&#039;t need to de-dupe atable.name in this manner.

Can I guess, this is reporting unique names and counts from a coordinate/tree/matrix table?

Have Fun</description>
		<content:encoded><![CDATA[<p>Got to love blogs and comments eh. You have solid comments. Please retry if you have time and update the post.<br />
The other issue with your rewrite, is you have split 1 SQL into 5 pieces of SQL&#8230; what happens if the data changes.</p>
<p>The original SQL with its nested scalar queries looks to have written by someone with a background in a database which has a optimizer which can handle this better.</p>
<p>When I see this type of join the table on itself stuff it is screaming poor design. You shouldn&#8217;t need to de-dupe atable.name in this manner.</p>
<p>Can I guess, this is reporting unique names and counts from a coordinate/tree/matrix table?</p>
<p>Have Fun</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: BjÃ¶rn</title>
		<link>http://www.tokutek.com/2009/06/long_index_keys/#comment-106</link>
		<dc:creator>BjÃ¶rn</dc:creator>
		<pubDate>Tue, 02 Jun 2009 05:50:05 +0000</pubDate>
		<guid isPermaLink="false">http://long_index_keys#comment-106</guid>
		<description>Hm, that should be equal to:

SELECT
    atable.name,
    COUNT(DISTINCT pd1.name) CountOfe2
FROM
    atable
INNER JOIN
   atable pd1 USING (id, off, len)
WHERE
   atable.name &lt;&gt; pd1.name
GROUP BY
  atable.name
ORDER BY
  CountOfe2 DESC;

I wonder, how does that compare to your solution for the customer&#039;s dataset?

With an index on (id, len, off) and some (probably bad) artificial data (22k rows), I get:

The original query: 17404 rows in set (34.01 sec)
The simple version: 17404 rows in set (1.40 sec)

So it&#039;s faster by a factor of 24.3 for me.

Using a bigger dataset (32k rows) I get:
The original query: 27460 rows in set (1 min 23.77 sec)
The simple query: 27460 rows in set (2.62 sec)

So we got a factor of about 32 now. That the gains increase with a growing dataset is probably also due to the original query getting pretty huge temporary tables (the test with 32k rows used almost 3GB worth of temp. tables here), while the simple query is pretty modest using well below 30MB for its temporary table (And I&#039;m on a slow RAID-1). At least for the random dataset I experimented with.

Interestingly, with my dataset, the original query doesn&#039;t see any performance improvements at all when adding the the index on (id, off, len) compared to the table created by the DDL statement you gave. So I guess my testing dataset is pretty different from the actual data you&#039;re working with (who would have guessed ;-)).

Any chance to get some real number for the simple query? Would be interesting to see how that compares to your solution.

Of course, this is all assuming that the query you&#039;ve shown is close enough to the real thing to make the simple version applicable.</description>
		<content:encoded><![CDATA[<p>Hm, that should be equal to:</p>
<p>SELECT<br />
    atable.name,<br />
    COUNT(DISTINCT pd1.name) CountOfe2<br />
FROM<br />
    atable<br />
INNER JOIN<br />
   atable pd1 USING (id, off, len)<br />
WHERE<br />
   atable.name <> pd1.name<br />
GROUP BY<br />
  atable.name<br />
ORDER BY<br />
  CountOfe2 DESC;</p>
<p>I wonder, how does that compare to your solution for the customer&#8217;s dataset?</p>
<p>With an index on (id, len, off) and some (probably bad) artificial data (22k rows), I get:</p>
<p>The original query: 17404 rows in set (34.01 sec)<br />
The simple version: 17404 rows in set (1.40 sec)</p>
<p>So it&#8217;s faster by a factor of 24.3 for me.</p>
<p>Using a bigger dataset (32k rows) I get:<br />
The original query: 27460 rows in set (1 min 23.77 sec)<br />
The simple query: 27460 rows in set (2.62 sec)</p>
<p>So we got a factor of about 32 now. That the gains increase with a growing dataset is probably also due to the original query getting pretty huge temporary tables (the test with 32k rows used almost 3GB worth of temp. tables here), while the simple query is pretty modest using well below 30MB for its temporary table (And I&#8217;m on a slow RAID-1). At least for the random dataset I experimented with.</p>
<p>Interestingly, with my dataset, the original query doesn&#8217;t see any performance improvements at all when adding the the index on (id, off, len) compared to the table created by the DDL statement you gave. So I guess my testing dataset is pretty different from the actual data you&#8217;re working with (who would have guessed <img src='http://www.tokutek.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> ).</p>
<p>Any chance to get some real number for the simple query? Would be interesting to see how that compares to your solution.</p>
<p>Of course, this is all assuming that the query you&#8217;ve shown is close enough to the real thing to make the simple version applicable.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

