Query slower with index on 14M rows table

Question 1

DB: Amazon RDS MySQL (OS: Linux, 2 vCPU, Memory: 8GB)

I have a table with almost 14M rows of data.

CREATE TABLE `meterreadings` (
 `Id` bigint(20) NOT NULL AUTO_INCREMENT,
 `meterid` varchar(16) DEFAULT NULL,
 `metervalue` int(11) DEFAULT NULL,
 `date_time` timestamp NULL DEFAULT NULL,
 PRIMARY KEY (`Id`),
 KEY `meterid` (`meterid`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;

As you can see, I use an index on meterid.

Another table which stores device IDs (around 100 rows of data)

CREATE TABLE `devices` (
`Id` bigint(20) NOT NULL AUTO_INCREMENT,
`meterid` varchar(16) DEFAULT NULL,
`location` varchar(8) DEFAULT NULL,
PRIMARY KEY (`Id`),
UNIQUE KEY `meterid_UNIQUE` (`meterid`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;

To get 15 minute aggregated data, I use the below query

SELECT AVG(metervalue) as value
 , DATE_FORMAT(date_time, "%d %b %Y %H:%i") as label 
FROM meterreadings 
WHERE meterid IN (SELECT meterid from devices) 
 AND date_time BETWEEN '2018-07-23' AND '2018-07-24' 
GROUP BY DATE(date_time), HOUR(date_time), MINUTE(date_time) DIV 15 
ORDER BY date_time ASC;

Query performance is very bad - It takes approximately around 12 seconds to execute, and causes a temporary spike in DB server usage as well.

EXPLAIN on this query returned this:

1 SIMPLE devices index meterid_UNIQUE meterid_UNIQUE 19 125 
 Using where; Using index; Using temporary; Using filesort
1 SIMPLE meterreadings ref meterid meterid 19 devices.meterid 322 
 Using where

I dropped the index on meterreadings and surprisingly the query performance is better - almost about 6 seconds now. I am still wondering why?

EXPLAIN on the query after dropping the index

1 SIMPLE meterreadings ALL 14580167 Using where; 
Using temporary; Using filesort
1 SIMPLE devices ref meterid_UNIQUE meterid_UNIQUE 19 
 meterreadings.meterid 1 Using index

I am currently doing my query operation on the table without index - Is there a way I can optimize the table / query to do the operation faster (like a composite index on two columns?)

[The table is growing approximately by around 40 rows per second]

Question 2

Try to add a composite index (date_time, meterid). Or ever covering index (date_time, meterid, metervalue).

Question 3

This will never be fast because you group by a function, thus impossible to build an index. However there are virtual columns in 5.7 which you can use in secondary indexes. I would explore in that direction.

Question 4

@Akina - Since that covering index is only slightly smaller than the table, it is probably not worth doing.

Question 5

@akuzminsky - Since the query probably needs to scan the entire table, indexes are probably useless -- the Optimizer will choose to scan the data and not use any index.

Question 6

You should play with it a bit, because it might not be clear beforehand what solution will produce the best results.

A few points to consider

It is very likely that an index on the date column will provide you with better selectivity, as it has a higher selectivity.
Composite indexes are usually a good idea, but please make sure to chose the order correctly date_time, meterid vs. meterid, date_time. In most cases it makes more sense to leave columns with dense values (i.e dates, floats) to the end, as any column in the index following them is unlikely to have any effect. ( try meterid, date_time for an index.)
Subselects might force the optimizer to use a specific plan. Try converting it into a join if possible.

Question 7

Why have WHERE meterid IN (SELECT meterid from devices)? Aren't all the devices represented in both tables?

PRIMARY KEY (`Id`),
UNIQUE KEY `meterid_UNIQUE` (`meterid`)

Get rid of id and change to

PRIMARY KEY(meterid)

To be clean you should use 15-minute intervals everywhere, not

DATE_FORMAT(date_time, "%d %b %Y %H:%i") as label

Instead consider

FLOOR(UNIX_TIMESTAMP(date_time) / (15*60))

Which can be converted back via

FROM_UNIXTIME(... * (15*60))

And formatted. For example:

mysql> SELECT NOW(), DATE_FORMAT( 
 FROM_UNIXTIME( 
 FLOOR(UNIX_TIMESTAMP(now()) / (15*60))
 *(15*60)
 ), "%d %b %Y %H:%i") as label;
+---------------------+-------------------+
| NOW() | label |
+---------------------+-------------------+
| 2018年08月21日 13:43:42 | 21 Aug 2018 13:30 |
+---------------------+-------------------+
SELECT AVG(metervalue) as value,
 DATE_FORMAT( 
 FROM_UNIXTIME( 
 FLOOR(UNIX_TIMESTAMP(date_time) / (15*60))
 *(15*60)
 ), "%d %b %Y %H:%i") as label 
 FROM meterreadings 
 WHERE date_time >= '2018-07-23'
 AND date_time < '2018-07-23' + INTERVAL 1 DAY -- bug fix
 GROUP BY meterid, -- Don't you want this, too?
 label
 ORDER BY meterid,
 label;

I fixed the case where you were including two midnights in one day.

More more efficiency, change

PRIMARY KEY (`Id`),
KEY `meterid` (`meterid`)

to this if your query is the main one

PRIMARY KEY (date_time, Id), -- to make it a range scan
INDEX(id) -- to keep AUTO_INCREMENT happy

If you can be sure that there are never two readings for a meter in the same second, get rid of id and have

PRIMARY KEY(date_time, meterid)

(Again, this may not be optimal, depending on what other queries you have.)

All of that will help some. If you want another 10x speedup, build and maintain Summary tables.

Question 8

As @akuzminsky indicated a generated column can help. The obvious way of doing this with UNIX_TIMESTAMP(date_time) DIV 15*60 however UNIX_TIMESTAMP is one of the functions not allowed in a generated column. So despite the ugliness of expression below, it does result in interval15 being the rounded timestamp to 15 minutes.

ALTER TABLE meterreadings ADD interval15 timestamp AS (
 SUBTIME(date_time, 
 CONCAT("0:", MINUTE(date_time) MOD 15, ":", SECOND(date_time),".", MICROSECOND(date_time)))),
 ADD INDEX interval15 (interval15);

The new query is:

SELECT AVG(metervalue) as value
 , DATE_FORMAT(interval15, "%d %b %Y %H:%i") as label 
FROM meterreadings
WHERE interval15 >= '2018-07-23'
 AND interval15 < '2018-07-23' + INTERVAL 1 DAY
GROUP BY interval15
ORDER BY interval15 ASC;

If you where optimizing this query only you could append the metervalue to the interval15 index and then the result would be from the index alone.

Question 9

using TO_SECONDS in the generated expression might have been possible. Didn't test it.

sinaiy sinaiy 561 bronze badge · Accepted Answer · 2018-08-06 08:37:40Z

You should play with it a bit, because it might not be clear beforehand what solution will produce the best results.

A few points to consider

It is very likely that an index on the date column will provide you with better selectivity, as it has a higher selectivity.
Composite indexes are usually a good idea, but please make sure to chose the order correctly date_time, meterid vs. meterid, date_time. In most cases it makes more sense to leave columns with dense values (i.e dates, floats) to the end, as any column in the index following them is unlikely to have any effect. ( try meterid, date_time for an index.)
Subselects might force the optimizer to use a specific plan. Try converting it into a join if possible.

Stack Exchange Network

Query slower with index on 14M rows table

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Query slower with index on 14M rows table

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions