Optimizing WHERE condition for TIMESTAMP field in MySQL SELECT statement

Question 1

I am working on a schema for an analytics system which tracks usage times, and there is a need to see total usage time in a certain date range.

To give a simple example, this type of query would be run often:

select sum(diff_ms) from writetest_table where time_on > ("2015-07-13 15:11:56");

This query typically takes around 7 seconds on a table that is heavily populated. It has ~35 million rows, MyISAM on MySQL running on Amazon RDS (db.m3.xlarge).

Getting rid of the WHERE clause makes the query take only 4 seconds, and adding a second clause (time_off> XXX) adds an additional 1.5 seconds, bringing the query time to 8.5 seconds.

Since I know these types of queries will be commonly done I would like to optimize things so they are faster, ideally below 5 seconds.

I started by adding an index on time_on, and though that drastically sped up a WHERE "=" query, it had no effect on the ">" query. Is there a way to create an index that would speed up the WHERE ">" or "<" queries?

Or if there are any other suggestions the performance of this type of query, please let me know.

Note: I am using the "diff_ms" field as a denormalization step (it equals time_off - time_on) which improves performance of the aggregation by around 30%-40%.

I am creating the index with this command:

ALTER TABLE writetest_table ADD INDEX time_on (time_on) USING BTREE;

Running "explain" on the original query (with "time_on>") says time_on is a "possible_key" and the select_type is "SIMPLE". The "extra" column says "Using where", and "type" is "ALL". After the index was added, the table says "time_on" is "MUL" key type, which seems correct since the same time can be present twice.

Here is the table schema:

CREATE TABLE `writetest_table` (
 `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
 `sessionID` int(11) DEFAULT NULL,
 `time_on` timestamp NULL DEFAULT NULL,
 `time_off` timestamp NULL DEFAULT NULL,
 `diff_ms` int(11) DEFAULT NULL,
 PRIMARY KEY (`id`),
 KEY `time_on` (`time_on`)
) ENGINE=MyISAM AUTO_INCREMENT=50410902 DEFAULT CHARSET=latin1;

UPDATE: I created the following index based on ypercube's response, but this increases the query time for the first query to around 17 seconds!

ALTER TABLE writetest_table ADD INDEX time_on__diff_ms__ix (time_on, diff_ms) ;

UPDATE 2: EXPLAIN output

mysql> explain select sum(diff_ms) from writetest_table where time_on > '2015-07-13 15:11:56';
+----+-------------+---------------------+-------+----------------------+----------------------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+-------+----------------------+----------------------+---------+------+----------+--------------------------+
| 1 | SIMPLE | writetest_table_old | index | time_on__diff_ms__ix | time_on__diff_ms__ix | 10 | NULL | 35831102 | Using where; Using index |
+----+-------------+---------------------+-------+----------------------+----------------------+---------+------+----------+--------------------------+
1 row in set (0.00 sec)

Update 3: result of requested query

mysql> SELECT time_on FROM writetest_table ORDER BY time_on LIMIT 1;
+---------------------+
| time_on |
+---------------------+
| 2015年07月13日 15:11:56 |
+---------------------+
1 row in set (0.01 sec)

Question 2

Do you actually have nulls in these 2 columns (time_on and diff_ms)? What happens if you add in the query WHERE ... AND diff_ms IS NOT NULL ?

Question 3

Can you please show us the output of SELECT COUNT(*), COUNT(diff_ms) FROM writetest_table;

Question 4

Also the explain in your "Update 2" shows "table: writetest_table_old" while the query has from writetest_table. Is that a typo or you run the query in different table?

Question 5

I think I am starting to understand.

When I asked you to run

SELECT time_on FROM writetest_table ORDER BY time_on LIMIT 1;

You said it was 2015年07月13日 15:11:56 which you have in your WHERE clause

When you did the query

select sum(diff_ms) from writetest_table;

It performed a full table scan of 35.8 million rows.

When you did the query

select sum(diff_ms) from writetest_table where time_on > ("2015-07-13 15:11:56");

It performed a full index scan of 35.8 million rows.

It totally makes sense that the query without the WHERE clause is faster. Why ?

The table scan would read 35.8 million rows in one linear pass.

The EXPLAIN on the query with the WHERE also turned up 35.8 million rows. An index scan would behave a little different. While the BTREE keeps the order of the keys, it is horrible for doing range scans. In your particular case, you are performing the worst possible range scan, which would have the same number of BTREE entries as there are rows in the table. MySQL has to traverse the BTREE pages (at least across the leaf nodes) to read the values. In addition, the time_on column has to be compared along the way in the order dictated by the index. Therefore, non-leaf BTREE nodes must be traversed as well.

Please see my posts on BTREEs

Aug 06, 2013 : In MySQL if column X has unique values what's the difference between UNIQUE index and B-Tree index
Jun 28, 2012 : Benefits of BTREE in MySQL

If the query was as of midnight today

select sum(diff_ms) from writetest_table where time_on >= ("2015-07-14 00:00:00");

or even noon today

select sum(diff_ms) from writetest_table where time_on >= ("2015-07-14 12:00:00");

it should take less time.

MORAL OF THE STORY : Do not use a WHERE clause that does an ordered range scan equal to the number of rows in the target table.

Question 6

My only issue is how to go from here. I did a query with a date that resulted in only 1 million rows filtered and the sum took only 1 second. But occasionally I may have to do aggregate sums across most of the data. Any suggestions how to handle this? I was hoping that MySQL would be smart enough to know when to use the index and when not too, but I guess it doesn't have enough information in this case.

Question 7

I really wish there was some sort of index that was organized to make WHERE clauses specifying date ranges fast, that seems like it would be technically possible to implement, but I guess it's not supported.

Question 8

You have way too much data in such a short range. No WHERE clause can ever be compensated. Why ? It's not the index that's the problem. It's the MySQL Query Optimizer's Opinion of the Index. When you start accumulating much more data (let's say about two weeks worth), the index stats should level off and you should see a performance improvement. Just don't do full index scans.

Question 9

For the specific query:

select sum(diff_ms) 
from writetest_table 
where time_on > '2015-07-13 15:11:56' ; -- use single quotes, not double

an index on (time_on, diff_ms) would be the best option. So, if the query runs often enough or its efficiency is crucial to your application, add this index:

ALTER TABLE writetest_table 
 ADD INDEX time_on__diff_ms__ix -- pick a name for the index
 (time_on, diff_ms) ;

_{(Not related to the question)

And really, change the table's engine to InnoDB. It's 2015 and MyISAM's funeral was some years ago.

(/rant)}

Question 10

I created the exact index you suggested and then ran the exact query you mentioned first in your response, but the time is now much worse, taking around 17 seconds consistently (I tried several times).

Question 11

I have no idea what is causing it. In case it matters, there are only 3671 distinct values of time_on in the table (this is due to how my test script is populating data).

Question 12

You should do three(3) things: 1. run ALTER TABLE writetest_table DROP INDEX time_on;, 2) run ANALYZE TABLE writetest_table;, and 3) rerun the query. Does the time go back to 7 seconds ?

Question 13

You should also run EXPLAIN select sum(diff_ms) from writetest_table where time_on > ("2015年07月13日 15:11:56");. Is the new index being used ? If it is not being used I would say it is your key population, esepcially if your earliest time_on is only a few days ago.As the number of rows increase with more distinct days, the key distribution should level off and the EXPLAIN should be better.

Question 14

RolandoMySQLDBA - I tried your three steps, and yes the time goes back to 7 seconds. I did the explain and it says the index is being used. I am still clueless why adding an index like this could make performance over 2x as bad.

score 3 · Accepted Answer · 2015-07-14 18:22:13Z

I think I am starting to understand.

When I asked you to run

SELECT time_on FROM writetest_table ORDER BY time_on LIMIT 1;

You said it was 2015年07月13日 15:11:56 which you have in your WHERE clause

When you did the query

select sum(diff_ms) from writetest_table;

It performed a full table scan of 35.8 million rows.

When you did the query

select sum(diff_ms) from writetest_table where time_on > ("2015-07-13 15:11:56");

It performed a full index scan of 35.8 million rows.

It totally makes sense that the query without the WHERE clause is faster. Why ?

The table scan would read 35.8 million rows in one linear pass.

The EXPLAIN on the query with the WHERE also turned up 35.8 million rows. An index scan would behave a little different. While the BTREE keeps the order of the keys, it is horrible for doing range scans. In your particular case, you are performing the worst possible range scan, which would have the same number of BTREE entries as there are rows in the table. MySQL has to traverse the BTREE pages (at least across the leaf nodes) to read the values. In addition, the time_on column has to be compared along the way in the order dictated by the index. Therefore, non-leaf BTREE nodes must be traversed as well.

Please see my posts on BTREEs

Aug 06, 2013 : In MySQL if column X has unique values what's the difference between UNIQUE index and B-Tree index
Jun 28, 2012 : Benefits of BTREE in MySQL

If the query was as of midnight today

select sum(diff_ms) from writetest_table where time_on >= ("2015-07-14 00:00:00");

or even noon today

select sum(diff_ms) from writetest_table where time_on >= ("2015-07-14 12:00:00");

it should take less time.

MORAL OF THE STORY : Do not use a WHERE clause that does an ordered range scan equal to the number of rows in the target table.

My only issue is how to go from here. I did a query with a date that resulted in only 1 million rows filtered and the sum took only 1 second. But occasionally I may have to do aggregate sums across most of the data. Any suggestions how to handle this? I was hoping that MySQL would be smart enough to know when to use the index and when not too, but I guess it doesn't have enough information in this case.
I really wish there was some sort of index that was organized to make WHERE clauses specifying date ranges fast, that seems like it would be technically possible to implement, but I guess it's not supported.
You have way too much data in such a short range. No WHERE clause can ever be compensated. Why ? It's not the index that's the problem. It's the MySQL Query Optimizer's Opinion of the Index. When you start accumulating much more data (let's say about two weeks worth), the index stats should level off and you should see a performance improvement. Just don't do full index scans.

Stack Exchange Network

Optimizing WHERE condition for TIMESTAMP field in MySQL SELECT statement

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Optimizing WHERE condition for TIMESTAMP field in MySQL SELECT statement

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions