SQL Clause is coming to town
https://ift.tt/2WSmK90
Olya Kudriavtseva has an ugly christmas sweater:
“He’s making a table. He’s sorting it twice. SELECT * FROM contacts WHERE behavior = “nice”; SQL Clause is coming town! (buy here)
I mean, except for the fact that sorting something twice is TERRIBLY optimized
So how bad is this? Let’s find out.
Some test data
We are defining a table santa
, where we store peoples names (GDPR, EU Regulation 2016/679 applies!), their behavior (naughty or nice), their age, their location, and their wishlist items.
We are also writing some code to generate data (to evade GDPR, we are using randomly generated test data):
The full data generator is available as santa.py. Note that the data generator there defines more indexes – see below.
In our example we generate one million rows, and assume a general niceness of 0.9 (90% of the children are nice). Also, all of our children have 64 characters long names, a single 64 characters long wish, a random age, and are equidistributed on a perfect sphere.
Our real planet is not a perfect sphere, and also not many people live in the Pacific Ocean. Also, not many children have 64 character names.
Sorting it twice
How do you even sort the data twice? Now, assuming we sort by name, we can run an increasingly deeply nested subquery:
Out of 1 million children, we have around 900k nice children. No indexes can be used to resolve the query.
Let’s order by name, using a subquery:
We can already see that the MySQL 8 optimizer recognizes that this subquery can be merged with the inner query, and does this.
This can be done multiple times, but the optimizer handles this just fine:
We can see using filesort
, so while we ask for the query result to be sorted by name twice, it is actually only sorted once.
No sorting at all
We can improve on this, using a covering index in appropriate order:
Having done this, we now see that we lost the using filesort
altogether:
The query is now annotated using index
, which means that all data we ask for is present in the (covering) index behavior_name
, and is stored in sort order. That means the data is physically stored and read in sort order and no actual sorting has to be done on read – despite us asking for sorting, twice.
Hidden ‘SELECT *’ and Index Condition Pushdown
In the example above, we have been asking for s.name
and t.name
only, and because the name is part of the index, using index
is shown to indicate use of a covering index. We do not actually go to the table to generate the result set, we are using the index only.
Now, if we were to ask for t.*
in the middle subquery, what will happen?
In the Code 1003 Note
we still see the exact same reconstituted query, but as can be seen in the plan annotastions, the internal handling changes – so the optimizer has not been working on this query at all times, but on some intermediary representation.
The ‘using index condition’ annotation points to Index Condition Pushdown Optimization being used. In our example, this is not good.
Worse than sorting: Selectivity
The column we select on is a column with a cardinality of 2: behavior
can be either naughty
or nice
. That means, in an equidistribution, around half of the values are naughty
, the other half is nice
.
Data from disk is read in pages of 16 KB. If one row in a page matches, the entire page has to be read from disk. In our example, we have a row length of around 200 Byte, so we end up with 75-80 records per page. Half of them will be nice
, so with an average of around 40 nice
records per page, we will very likely have to read all pages from disk anyway.
Using the index will not decrease the amount of data read from disk at all. In fact we will have to read the index pages on top of the data pages, so using an index on a low cardinality column has the potential of making the situation slightly worse than even a full table scan.
Generally speaking, defining an index on a low cardinality column is usually not helpful – if there are 10 or fewer values, benchmark carefully and decide, or just don’t define an index.
In our case, the index is not even equidistributed, but biased to 90% nice
. We end up with mostly nice
records, so we can guarantee that all data pages will be read for the SQL SELECT * FROM santa WHERE behavior = "nice"
, and the index usage will not be contributing in any positive way.
We could try to improve the query by adding conditions to make it more selective. For example, we could ask for people close to our current position, using an RTREE index such as this:
The ALTER
defines a spatial index (an RTREE), which can help to speed up coordinate queries.
The SET
defines a coordinate rectangle around our current position (supposedly 15/15).
We then use the mbrcovers()
function to find all points loc
that are covered by the @rect
. It seems to be somewhat complicated to get MySQL to actually use the index, but I have not been investigating deeply.
If we added an ORDER BY name
here, we would see using filesort
again, because data is retrieved in RTREE order, if the index loc
is used, but we want output in name
order.
Conclusion
- The Santa query is inefficient, but likely sorting twice is not the root cause for that.
- The optimizer will be able to merge the multiple sorts and be able to deliver the result with one or no sorting, depending on our index construction.
- The optimizer is not using the reconstituted query shown in the warning to plan the execution, and that is weird.
- Selectivity matters, especially for indices on low cardinality columns.
- Asking for all
nice
behaviors on anaughty
/nice
column is usually not benefitting from index usage. - Additional indexable conditions that improve selectivity can help, a lot.
- Asking for all
technology
via Planet MySQL https://ift.tt/2iO8Ob8
December 30, 2020 at 06:56AM