8

I am working on a schema for an analytics system which tracks usage times, and there is a need to see total usage time in a certain date range.

To give a simple example, this type of query would be run often:

select sum(diff_ms) from writetest_table where time_on > ("2015-07-13 15:11:56");

This query typically takes around 7 seconds on a table that is heavily populated. It has ~35 million rows, MyISAM on MySQL running on Amazon RDS (db.m3.xlarge).

Getting rid of the WHERE clause makes the query take only 4 seconds, and adding a second clause (time_off> XXX) adds an additional 1.5 seconds, bringing the query time to 8.5 seconds.

Since I know these types of queries will be commonly done I would like to optimize things so they are faster, ideally below 5 seconds.

I started by adding an index on time_on, and though that drastically sped up a WHERE "=" query, it had no effect on the ">" query. Is there a way to create an index that would speed up the WHERE ">" or "<" queries?

Or if there are any other suggestions the performance of this type of query, please let me know.

Note: I am using the "diff_ms" field as a denormalization step (it equals time_off - time_on) which improves performance of the aggregation by around 30%-40%.

I am creating the index with this command:

ALTER TABLE writetest_table ADD INDEX time_on (time_on) USING BTREE;

Running "explain" on the original query (with "time_on>") says time_on is a "possible_key" and the select_type is "SIMPLE". The "extra" column says "Using where", and "type" is "ALL". After the index was added, the table says "time_on" is "MUL" key type, which seems correct since the same time can be present twice.

Here is the table schema:

CREATE TABLE `writetest_table` (
 `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
 `sessionID` int(11) DEFAULT NULL,
 `time_on` timestamp NULL DEFAULT NULL,
 `time_off` timestamp NULL DEFAULT NULL,
 `diff_ms` int(11) DEFAULT NULL,
 PRIMARY KEY (`id`),
 KEY `time_on` (`time_on`)
) ENGINE=MyISAM AUTO_INCREMENT=50410902 DEFAULT CHARSET=latin1;

UPDATE: I created the following index based on ypercube's response, but this increases the query time for the first query to around 17 seconds!

ALTER TABLE writetest_table ADD INDEX time_on__diff_ms__ix (time_on, diff_ms) ;

UPDATE 2: EXPLAIN output

mysql> explain select sum(diff_ms) from writetest_table where time_on > '2015-07-13 15:11:56';
+----+-------------+---------------------+-------+----------------------+----------------------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+-------+----------------------+----------------------+---------+------+----------+--------------------------+
| 1 | SIMPLE | writetest_table_old | index | time_on__diff_ms__ix | time_on__diff_ms__ix | 10 | NULL | 35831102 | Using where; Using index |
+----+-------------+---------------------+-------+----------------------+----------------------+---------+------+----------+--------------------------+
1 row in set (0.00 sec)

Update 3: result of requested query

mysql> SELECT time_on FROM writetest_table ORDER BY time_on LIMIT 1;
+---------------------+
| time_on |
+---------------------+
| 2015年07月13日 15:11:56 |
+---------------------+
1 row in set (0.01 sec)
RolandoMySQLDBA
185k34 gold badges327 silver badges541 bronze badges
asked Jul 14, 2015 at 12:42
3
  • Do you actually have nulls in these 2 columns (time_on and diff_ms)? What happens if you add in the query WHERE ... AND diff_ms IS NOT NULL ? Commented Jul 14, 2015 at 18:40
  • Can you please show us the output of SELECT COUNT(*), COUNT(diff_ms) FROM writetest_table; Commented Jul 14, 2015 at 20:10
  • Also the explain in your "Update 2" shows "table: writetest_table_old" while the query has from writetest_table. Is that a typo or you run the query in different table? Commented Jul 14, 2015 at 20:12

2 Answers 2

3

I think I am starting to understand.

When I asked you to run

SELECT time_on FROM writetest_table ORDER BY time_on LIMIT 1;

You said it was 2015年07月13日 15:11:56 which you have in your WHERE clause

When you did the query

select sum(diff_ms) from writetest_table;

It performed a full table scan of 35.8 million rows.

When you did the query

select sum(diff_ms) from writetest_table where time_on > ("2015-07-13 15:11:56");

It performed a full index scan of 35.8 million rows.

It totally makes sense that the query without the WHERE clause is faster. Why ?

The table scan would read 35.8 million rows in one linear pass.

The EXPLAIN on the query with the WHERE also turned up 35.8 million rows. An index scan would behave a little different. While the BTREE keeps the order of the keys, it is horrible for doing range scans. In your particular case, you are performing the worst possible range scan, which would have the same number of BTREE entries as there are rows in the table. MySQL has to traverse the BTREE pages (at least across the leaf nodes) to read the values. In addition, the time_on column has to be compared along the way in the order dictated by the index. Therefore, non-leaf BTREE nodes must be traversed as well.

Please see my posts on BTREEs

If the query was as of midnight today

select sum(diff_ms) from writetest_table where time_on >= ("2015-07-14 00:00:00");

or even noon today

select sum(diff_ms) from writetest_table where time_on >= ("2015-07-14 12:00:00");

it should take less time.

MORAL OF THE STORY : Do not use a WHERE clause that does an ordered range scan equal to the number of rows in the target table.

answered Jul 14, 2015 at 18:22
3
  • My only issue is how to go from here. I did a query with a date that resulted in only 1 million rows filtered and the sum took only 1 second. But occasionally I may have to do aggregate sums across most of the data. Any suggestions how to handle this? I was hoping that MySQL would be smart enough to know when to use the index and when not too, but I guess it doesn't have enough information in this case. Commented Jul 14, 2015 at 20:04
  • I really wish there was some sort of index that was organized to make WHERE clauses specifying date ranges fast, that seems like it would be technically possible to implement, but I guess it's not supported. Commented Jul 14, 2015 at 20:04
  • You have way too much data in such a short range. No WHERE clause can ever be compensated. Why ? It's not the index that's the problem. It's the MySQL Query Optimizer's Opinion of the Index. When you start accumulating much more data (let's say about two weeks worth), the index stats should level off and you should see a performance improvement. Just don't do full index scans. Commented Jul 14, 2015 at 20:19
4

For the specific query:

select sum(diff_ms) 
from writetest_table 
where time_on > '2015-07-13 15:11:56' ; -- use single quotes, not double

an index on (time_on, diff_ms) would be the best option. So, if the query runs often enough or its efficiency is crucial to your application, add this index:

ALTER TABLE writetest_table 
 ADD INDEX time_on__diff_ms__ix -- pick a name for the index
 (time_on, diff_ms) ;

(Not related to the question)
And really, change the table's engine to InnoDB. It's 2015 and MyISAM's funeral was some years ago.
(/rant)

answered Jul 14, 2015 at 14:05
7
  • I created the exact index you suggested and then ran the exact query you mentioned first in your response, but the time is now much worse, taking around 17 seconds consistently (I tried several times). Commented Jul 14, 2015 at 14:45
  • I have no idea what is causing it. In case it matters, there are only 3671 distinct values of time_on in the table (this is due to how my test script is populating data). Commented Jul 14, 2015 at 14:56
  • You should do three(3) things: 1. run ALTER TABLE writetest_table DROP INDEX time_on;, 2) run ANALYZE TABLE writetest_table;, and 3) rerun the query. Does the time go back to 7 seconds ? Commented Jul 14, 2015 at 15:05
  • 1
    You should also run EXPLAIN select sum(diff_ms) from writetest_table where time_on > ("2015年07月13日 15:11:56");. Is the new index being used ? If it is not being used I would say it is your key population, esepcially if your earliest time_on is only a few days ago.As the number of rows increase with more distinct days, the key distribution should level off and the EXPLAIN should be better. Commented Jul 14, 2015 at 15:09
  • RolandoMySQLDBA - I tried your three steps, and yes the time goes back to 7 seconds. I did the explain and it says the index is being used. I am still clueless why adding an index like this could make performance over 2x as bad. Commented Jul 14, 2015 at 16:35

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.