I am working on a schema for an analytics system which tracks usage times, and there is a need to see total usage time in a certain date range.
To give a simple example, this type of query would be run often:
select sum(diff_ms) from writetest_table where time_on > ("2015-07-13 15:11:56");
This query typically takes around 7 seconds on a table that is heavily populated. It has ~35 million rows, MyISAM on MySQL running on Amazon RDS (db.m3.xlarge).
Getting rid of the WHERE clause makes the query take only 4 seconds, and adding a second clause (time_off> XXX) adds an additional 1.5 seconds, bringing the query time to 8.5 seconds.
Since I know these types of queries will be commonly done I would like to optimize things so they are faster, ideally below 5 seconds.
I started by adding an index on time_on, and though that drastically sped up a WHERE "=" query, it had no effect on the ">" query. Is there a way to create an index that would speed up the WHERE ">" or "<" queries?
Or if there are any other suggestions the performance of this type of query, please let me know.
Note: I am using the "diff_ms" field as a denormalization step (it equals time_off - time_on) which improves performance of the aggregation by around 30%-40%.
I am creating the index with this command:
ALTER TABLE writetest_table ADD INDEX time_on (time_on) USING BTREE;
Running "explain" on the original query (with "time_on>") says time_on is a "possible_key" and the select_type is "SIMPLE". The "extra" column says "Using where", and "type" is "ALL". After the index was added, the table says "time_on" is "MUL" key type, which seems correct since the same time can be present twice.
Here is the table schema:
CREATE TABLE `writetest_table` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`sessionID` int(11) DEFAULT NULL,
`time_on` timestamp NULL DEFAULT NULL,
`time_off` timestamp NULL DEFAULT NULL,
`diff_ms` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `time_on` (`time_on`)
) ENGINE=MyISAM AUTO_INCREMENT=50410902 DEFAULT CHARSET=latin1;
UPDATE: I created the following index based on ypercube's response, but this increases the query time for the first query to around 17 seconds!
ALTER TABLE writetest_table ADD INDEX time_on__diff_ms__ix (time_on, diff_ms) ;
UPDATE 2: EXPLAIN output
mysql> explain select sum(diff_ms) from writetest_table where time_on > '2015-07-13 15:11:56';
+----+-------------+---------------------+-------+----------------------+----------------------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+-------+----------------------+----------------------+---------+------+----------+--------------------------+
| 1 | SIMPLE | writetest_table_old | index | time_on__diff_ms__ix | time_on__diff_ms__ix | 10 | NULL | 35831102 | Using where; Using index |
+----+-------------+---------------------+-------+----------------------+----------------------+---------+------+----------+--------------------------+
1 row in set (0.00 sec)
Update 3: result of requested query
mysql> SELECT time_on FROM writetest_table ORDER BY time_on LIMIT 1;
+---------------------+
| time_on |
+---------------------+
| 2015年07月13日 15:11:56 |
+---------------------+
1 row in set (0.01 sec)
2 Answers 2
I think I am starting to understand.
When I asked you to run
SELECT time_on FROM writetest_table ORDER BY time_on LIMIT 1;
You said it was 2015年07月13日 15:11:56
which you have in your WHERE
clause
When you did the query
select sum(diff_ms) from writetest_table;
It performed a full table scan of 35.8 million rows.
When you did the query
select sum(diff_ms) from writetest_table where time_on > ("2015-07-13 15:11:56");
It performed a full index scan of 35.8 million rows.
It totally makes sense that the query without the WHERE clause is faster. Why ?
The table scan would read 35.8 million rows in one linear pass.
The EXPLAIN on the query with the WHERE also turned up 35.8 million rows. An index scan would behave a little different. While the BTREE keeps the order of the keys, it is horrible for doing range scans. In your particular case, you are performing the worst possible range scan, which would have the same number of BTREE entries as there are rows in the table. MySQL has to traverse the BTREE pages (at least across the leaf nodes) to read the values. In addition, the time_on
column has to be compared along the way in the order dictated by the index. Therefore, non-leaf BTREE nodes must be traversed as well.
Please see my posts on BTREEs
Aug 06, 2013
: In MySQL if column X has unique values what's the difference between UNIQUE index and B-Tree indexJun 28, 2012
: Benefits of BTREE in MySQL
If the query was as of midnight today
select sum(diff_ms) from writetest_table where time_on >= ("2015-07-14 00:00:00");
or even noon today
select sum(diff_ms) from writetest_table where time_on >= ("2015-07-14 12:00:00");
it should take less time.
MORAL OF THE STORY : Do not use a WHERE clause that does an ordered range scan equal to the number of rows in the target table.
-
My only issue is how to go from here. I did a query with a date that resulted in only 1 million rows filtered and the sum took only 1 second. But occasionally I may have to do aggregate sums across most of the data. Any suggestions how to handle this? I was hoping that MySQL would be smart enough to know when to use the index and when not too, but I guess it doesn't have enough information in this case.Locksleyu– Locksleyu2015年07月14日 20:04:01 +00:00Commented Jul 14, 2015 at 20:04
-
I really wish there was some sort of index that was organized to make WHERE clauses specifying date ranges fast, that seems like it would be technically possible to implement, but I guess it's not supported.Locksleyu– Locksleyu2015年07月14日 20:04:59 +00:00Commented Jul 14, 2015 at 20:04
-
You have way too much data in such a short range. No WHERE clause can ever be compensated. Why ? It's not the index that's the problem. It's the MySQL Query Optimizer's Opinion of the Index. When you start accumulating much more data (let's say about two weeks worth), the index stats should level off and you should see a performance improvement. Just don't do full index scans.RolandoMySQLDBA– RolandoMySQLDBA2015年07月14日 20:19:18 +00:00Commented Jul 14, 2015 at 20:19
For the specific query:
select sum(diff_ms)
from writetest_table
where time_on > '2015-07-13 15:11:56' ; -- use single quotes, not double
an index on (time_on, diff_ms)
would be the best option. So, if the query runs often enough or its efficiency is crucial to your application, add this index:
ALTER TABLE writetest_table
ADD INDEX time_on__diff_ms__ix -- pick a name for the index
(time_on, diff_ms) ;
(Not related to the question)
And really, change the table's engine to InnoDB. It's 2015 and MyISAM's funeral was some years ago.
(/rant)
-
I created the exact index you suggested and then ran the exact query you mentioned first in your response, but the time is now much worse, taking around 17 seconds consistently (I tried several times).Locksleyu– Locksleyu2015年07月14日 14:45:18 +00:00Commented Jul 14, 2015 at 14:45
-
I have no idea what is causing it. In case it matters, there are only 3671 distinct values of time_on in the table (this is due to how my test script is populating data).Locksleyu– Locksleyu2015年07月14日 14:56:31 +00:00Commented Jul 14, 2015 at 14:56
-
You should do three(3) things: 1. run
ALTER TABLE writetest_table DROP INDEX time_on;
, 2) runANALYZE TABLE writetest_table;
, and 3) rerun the query. Does the time go back to 7 seconds ?RolandoMySQLDBA– RolandoMySQLDBA2015年07月14日 15:05:39 +00:00Commented Jul 14, 2015 at 15:05 -
1You should also run
EXPLAIN select sum(diff_ms) from writetest_table where time_on > ("2015年07月13日 15:11:56");
. Is the new index being used ? If it is not being used I would say it is your key population, esepcially if your earliest time_on is only a few days ago.As the number of rows increase with more distinct days, the key distribution should level off and the EXPLAIN should be better.RolandoMySQLDBA– RolandoMySQLDBA2015年07月14日 15:09:31 +00:00Commented Jul 14, 2015 at 15:09 -
RolandoMySQLDBA - I tried your three steps, and yes the time goes back to 7 seconds. I did the explain and it says the index is being used. I am still clueless why adding an index like this could make performance over 2x as bad.Locksleyu– Locksleyu2015年07月14日 16:35:42 +00:00Commented Jul 14, 2015 at 16:35
Explore related questions
See similar questions with these tags.
time_on
anddiff_ms
)? What happens if you add in the queryWHERE ... AND diff_ms IS NOT NULL
?SELECT COUNT(*), COUNT(diff_ms) FROM writetest_table;
writetest_table_old
" while the query hasfrom writetest_table
. Is that a typo or you run the query in different table?