I have a table named 'customer' with the following structure and indexes:
DESCRIBE customer;
+--------------------+---------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+---------------+------+-----+-------------------+-----------------------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| email | varchar(70) | NO | UNI | NULL | |
| orders | int(11) | YES | | NULL | |
| country | text | YES | | NULL | |
| last_date_order | datetime | YES | | NULL | |
| first_date_order | datetime | YES | | NULL | |
| name | varchar(100) | YES | | NULL | |
| lastname | varchar(100) | YES | | NULL | |
| male | tinyint(1) | YES | | NULL | |
| gender_probability | decimal(10,2) | YES | | NULL | |
| phone | varchar(20) | YES | | NULL | |
| phone_formatted | varchar(20) | YES | | NULL | |
| created_at | timestamp | YES | | NULL | |
| updated_at | timestamp | NO | MUL | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+--------------------+---------------+------+-----+-------------------+-----------------------------+
SHOW INDEX FROM customer;
+----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| customer | 0 | PRIMARY | 1 | id | A | 1899746 | NULL | NULL | | BTREE | | |
| customer | 0 | email | 1 | email | A | 1899746 | NULL | NULL | | BTREE | | |
| customer | 1 | updated_at | 1 | updated_at | A | 200 | NULL | NULL | | BTREE | | |
+----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
I can execute this query, which takes 15 seconds approximately:
SELECT * FROM customer WHERE updated_at >= '2018-02-16 00:00:00';
931817 rows in set (14.33 sec)
And this is the same query but adding an order by
:
SELECT * FROM customer WHERE updated_at >= '2018-02-16 00:00:00' ORDER BY updated_at ASC;
931817 rows in set (5 min 21.17 sec)
I think 5 mins is too slow, the explain for the query shows:
EXPLAIN SELECT * FROM customer WHERE updated_at >= '2018-02-16 00:00:00' ORDER BY updated_at ASC;
+----+-------------+----------+------+---------------+------+---------+------+---------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------+---------+------+---------+-----------------------------+
| 1 | SIMPLE | customer | ALL | updated_at | NULL | NULL | NULL | 1899744 | Using where; Using filesort |
+----+-------------+----------+------+---------------+------+---------+------+---------+-----------------------------+
The profile says this:
SET profiling = 1;
SELECT * FROM customer WHERE updated_at >= '2018-02-16 00:00:00' ORDER BY updated_at ASC;
SHOW PROFILE;
SET profiling = 0;
+------------------------------+------------+
| Status | Duration |
+------------------------------+------------+
| Sending data | 0.036616 |
| Waiting for query cache lock | 0.000034 |
| Sending data | 0.088051 |
| Waiting for query cache lock | 0.000034 |
| Sending data | 0.064104 |
| Waiting for query cache lock | 0.000038 |
| Sending data | 0.045420 |
| Waiting for query cache lock | 0.000043 |
| Sending data | 0.035219 |
| Waiting for query cache lock | 0.000025 |
| Sending data | 0.049755 |
| Waiting for query cache lock | 0.000035 |
| Sending data | 0.069008 |
| Waiting for query cache lock | 0.000036 |
| Sending data | 0.068619 |
| Waiting for query cache lock | 0.000042 |
| Sending data | 0.067039 |
| Waiting for query cache lock | 0.000048 |
| Sending data | 0.051564 |
| Waiting for query cache lock | 0.000042 |
| Sending data | 0.052017 |
| Waiting for query cache lock | 0.000036 |
| Sending data | 0.049105 |
| Waiting for query cache lock | 0.000034 |
| Sending data | 0.060379 |
| Waiting for query cache lock | 0.000029 |
| Sending data | 0.118096 |
| Waiting for query cache lock | 0.000027 |
| Sending data | 0.046678 |
| Waiting for query cache lock | 0.000037 |
| Sending data | 0.058142 |
| Waiting for query cache lock | 0.000027 |
| Sending data | 0.137221 |
| Waiting for query cache lock | 0.000046 |
| Sending data | 0.061968 |
| Waiting for query cache lock | 0.000045 |
| Sending data | 0.054456 |
| Waiting for query cache lock | 0.000045 |
| Sending data | 0.054738 |
| Waiting for query cache lock | 0.000025 |
| Sending data | 0.073921 |
| Waiting for query cache lock | 0.000028 |
| Sending data | 0.040226 |
| Waiting for query cache lock | 0.000034 |
| Sending data | 0.036024 |
| Waiting for query cache lock | 0.000046 |
| Sending data | 0.063350 |
| Waiting for query cache lock | 0.000035 |
| Sending data | 0.044174 |
| Waiting for query cache lock | 0.000028 |
| Sending data | 0.057658 |
| Waiting for query cache lock | 0.000031 |
| Sending data | 0.081529 |
| Waiting for query cache lock | 0.000041 |
| Sending data | 0.074443 |
| Waiting for query cache lock | 0.000027 |
| Sending data | 0.058293 |
| Waiting for query cache lock | 0.000033 |
| Sending data | 0.061443 |
| Waiting for query cache lock | 0.000037 |
| Sending data | 0.075515 |
| Waiting for query cache lock | 0.000035 |
| Sending data | 0.075264 |
| Waiting for query cache lock | 0.000039 |
| Sending data | 0.049696 |
| Waiting for query cache lock | 0.000027 |
| Sending data | 0.052088 |
| Waiting for query cache lock | 0.000036 |
| Sending data | 0.049119 |
| Waiting for query cache lock | 0.000039 |
| Sending data | 0.053930 |
| Waiting for query cache lock | 0.000028 |
| Sending data | 0.064894 |
| Waiting for query cache lock | 0.000039 |
| Sending data | 0.046105 |
| Waiting for query cache lock | 0.000027 |
| Sending data | 0.061840 |
| Waiting for query cache lock | 0.000033 |
| Sending data | 0.065950 |
| Waiting for query cache lock | 0.000038 |
| Sending data | 0.053379 |
| Waiting for query cache lock | 0.000028 |
| Sending data | 0.034599 |
| Waiting for query cache lock | 0.000027 |
| Sending data | 0.046371 |
| Waiting for query cache lock | 0.000033 |
| Sending data | 0.039776 |
| Waiting for query cache lock | 0.000035 |
| Sending data | 0.046480 |
| Waiting for query cache lock | 0.000040 |
| Sending data | 0.039515 |
| Waiting for query cache lock | 0.000036 |
| Sending data | 405.206648 |
| end | 0.000040 |
| query end | 0.000011 |
| closing tables | 0.000017 |
| freeing items | 0.000036 |
| logging slow query | 0.000006 |
| logging slow query | 0.000011 |
| cleaning up | 0.000006 |
+------------------------------+------------+
100 rows in set (0.01 sec)
I would appreciate any thoughts on why the order by
impacts the performance so much? 15s to 5mins+ is a large jump.
EXPLAIN for the query of RolandoMySQLDBA
SELECT B.* FROM (SELECT id FROM customer WHERE updated_at >= '2018-02-16 00:00:00' ORDER BY updated_at ASC) A LEFT JOIN customer B USING (id);
932016 rows in set (6 min 6.39 sec)
+----+-------------+------------+--------+---------------+------------+---------+------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+------------+---------+------+--------+--------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 932025 | |
| 1 | PRIMARY | B | eq_ref | PRIMARY | PRIMARY | 4 | A.id | 1 | |
| 2 | DERIVED | customer | range | updated_at | updated_at | 4 | NULL | 949956 | Using where; Using index |
+----+-------------+------------+--------+---------------+------------+---------+------+--------+--------------------------+
3 Answers 3
Based on what you have in the question
1899746
rows in the table931817
rows>2018年02月16日 00:00:00
- Index on
updated_at
has 200 distinct values
The EXPLAIN plan shows a full table scan and a sort of a temp table which has all columns. This makes sense since the WHERE clause sees 49.0496% of the table rows. It is easier to do a full table than figure out from the key distribution that accessing the index also requires accessing the table. This explains why the EXPLAIN plan did not choose the updated_at
index.
The distribution of key values in the updated_at
index can be seen by running this
SELECT COUNT(1) rowcount,updated_at FROM customer GROUP BY updated_at;
You could use the index to collect the ids first and then join them later
SELECT B.* FROM
(SELECT id FROM customer
WHERE updated_at >= '2018-02-16 00:00:00'
ORDER BY updated_at ASC) A
LEFT JOIN customer B USING (id);
If the subquery collecting the ids chooses the index, then the whole query may be fast. Please run the EXPLAIN on this and see if this is the case.
-
The failing of this, and the original post, is the "bouncing back and forth between the index and the data.Rick James– Rick James2018年02月24日 01:21:01 +00:00Commented Feb 24, 2018 at 1:21
-
Newer versions of MySQL (and MariaDB) are permitted to ignore the
ORDER BY
in the derived table. The logic is that a derived table delivers a "set", which has no order.Rick James– Rick James2018年02月24日 01:25:56 +00:00Commented Feb 24, 2018 at 1:25
If this is a crucial part of your application, you could change the PRIMARY KEY
to start with updated_at
(and change the existing primary key to a UNIQUE
key/constraint):
-- table/index reconstruction
-- this will take some time
-- but only needs to be done once
ALTER TABLE customer
DROP PRIMARY KEY,
DROP INDEX updated_at,
ADD PRIMARY KEY (updated_at, id),
ADD UNIQUE INDEX id_uq (id) ;
The above change will make the clustering key of the table to be based on updated_at
, so there will be 2 benefits for the query:
- do only a partial table scan,
- skip the sorting part, as the rows will be already sorted the way you want.
-
For more efficiency, the index on
id
can be simplyINDEX
, notUNIQUE
.Rick James– Rick James2018年02月24日 01:19:50 +00:00Commented Feb 24, 2018 at 1:19
Some explanations...
SELECT * ... WHERE ...
The Optimizer sees that much of the table is needed, so it simply reads the data, ignoring the index.
SELECT * ... WHERE ... ORDER BY ...
Now the Optimizer has a touch choice -- (1) Scan the table, then sort the million rows. (2) Scan the index (to avoid the sort), but have to bounce between the index and the data. Either way is costly.
SELECT ... FROM ( SELECT id ... ) JOIN ...
This just makes things worse because it now has to build a temp table with the million ids. Nothing saved; extra stuff added.
PRIMARY KEY(updated_at, id)
By including id
, the PK is unique (a requirement). Assuming that the table is InnoDB, by starting the PK with updated_at
, the WHERE
and ORDER BY
are both satisfied by a "range" scan of the data. And only part (about half) of the table needs to be scanned.
Profiling is essentially useless. 99% of the time is spent in the meaningless "Sending data" or one or two other cryptic phases.
One thing that may help some formulations: TEXT
is overkill for country
; change it to a suitably short VARCHAR
. That will allow for a slightly faster way of dealing with any temp table.
Explore related questions
See similar questions with these tags.
SELECT *
necessary for your query, or can you get by with fewer columns? If so, which columns do you actually need?