How to re-write this query or restructure my table for SPEED

Question 1

I have an InnoDB MySQL table of this structure:

column	type
service	varchar(16)
url	varchar(255)
datetime	datetime

It contains logs for requests made to one of multiple services. It contains ~3 million rows, and gets ~3 million more rows a month (but I will delete old data if need be).

I am trying to generate a report, and get the number of requests made to each service within a date range.

Here is my current query:

 SELECT service, COUNT(*) as lastMonthCount 
 FROM request_logs 
 WHERE datetime > '2021-02-16 10:51:05' 
GROUP BY service

This works, but is painfully slow (~28 seconds).

It outputs this:

service	lastMonthCount
API A	3056752
API B	38451

I have indexes on datetime and service. I can see they are of type BTREE.

How can I radically speed up this query, or restructure my table/indexes so I can achieve the same use case another way?

Question 2

Do you have a separate index for datetime and a separate index for service resulting in a total of 2 indexes, or do you have 1 single index on both fields? 28 seconds actually doesn't sound too bad for MySQL for that many rows, especially with a VARCHAR(16) being one of the indexed predicates, but I'll think if there's an architectural improvement you can make, as it sounds like you're close as far as indexes go.

Question 3

@J.D. I have two indexes, one for each column. I will try a composite index now as suggested by David Spillet. regarding the VARCHAR(16), that could be changed, since the services are known and won't change often. What do you think would be best, an Enum or?

Question 4

Please see my answer with the same recommendation on indexing as David, but more so with additional information regarding your above question on data types.

Question 5

If it is useful to keep data older than 1 month, then do so. Just be clear so that materialized views or indexes can work around your choice there. And repeatedly deleting 'old' data can be done in a much more efficient way than with DELETE.

Question 6

For a relatively simple query like that there isn't a lot you can do to optimise it as there is little room for change (there isn't a simple way of asking for that data).

You can probably reduce the amount of pages being touched as it runs significantly by having an index on datetime and service. This way the data it needs to group by will already be available in what it has read to perform the filter on the date. This will increase the amount of data on disk as the index will be larger, and slow down writes a touch for the same reason. You would probably want to replace the existing index on datetime with the new composite index instead of just adding it, for those reasons.

A little more detail on why this will be faster:

With the index just on the datetime, it will need to read the base table data pages to get the service column for each matching row which as well as being extra page reads just because of referencing the other structure it will possibly be many more page reads because those pages contain less rows each due to the larger data per row (they include the URL column and any other properties that you add, or might already have, for the services).

If that still isn't fast enough...

You may need to look at some form of caching for the counts. There are several options here:

a materialised view may work if they are intelligent enough in mysql
a little denormalising, by including a counts table updated by trigger when the main table is updated (this is essentially a more "manual" version of a materialised view)
something in the application layer, if you can live with the result not necessarily being 100% up-to-date every time (the fastest option by far if done right, but obviously with that key disadvantage)
if the data is big enough that the query is reading from storage rather than RAM each time and you don't want to use one/more of the three options above: throw hardware at the problem and buy oodles of RAM! (this is usually not a good solution, though sometimes can be)

Question 7

Thanks @David, I added an index across two columns and the query went from 28 to 16 seconds.

Question 8

Specifying the order of the columns in the index is important. I believe (service,datetime) would be best. (That is, service being the leftmost column in the index.)

Question 9

@WillemRenzema - no, datetime,service should be faster as it will filter by date before grouping. With datetime,service you should get a seek followed by a partial index scan (from the point the seek found), with service,datetime it'll likely perform a full index scan unless it realises a skip-scan can be used (but that will likely still be slower I think).

Question 10

@DavidSpillett Perhaps. I personally would try both and see which is better, but you may be right about datetime first being better, so I won't argue that. However, I think any mention of a composite index should be clear about what column order you are suggesting. Your answer currently looks non-specific on the order, as it is just saying the index should be on both columns.

Question 11

Unfortunately MySQL is sometimes a little more limited in it's options compared to other modern RDBMS. One common way to solve the problem you're facing in other systems is to use something called a Materialized View. While not officially a feature of MySQL, you can replicate the behavior with a bit of coding as demonstrated in Speeding Up MySQL Using Materialized Views and MATERIALIZED VIEWS WITH MYSQL.

You may also find some useful information from this DBA.StackExchange answer which gives some alternatives such as creating summary tables. Of course that means you'll need to maintain the data in two places, but you can automate this with Triggers.

Finally, as I prompted in the comments, if your table currently has two separate indexes on the datetime field and the service field, only one of those indexes can be used at a time to serve your query. The most optimal index to improve your query would likely involve creating one index on datetime, service so that way after your datetime field filters down the results, the remaining rows it returns already has the service field inclusive in the index, ready to go for the grouping.

Also to answer your question in the comments, VARCHAR(16) isn't a terrible data type to index, it's just 4x as big as an INT, for example. I doubt you'll see game changing performance by changing the data type but you can experiment with switching to an INT and having a reference table with the actual service names stored in it (with a foreign key relationship from your main table). You could also try the ENUM data type but I'm not personally familiar with it and have heard general recommendations against.

Question 12

Thanks a lot for all that. I have added a composite (datetime, service) index which increased performance by 50%, I will experiment with changing service to an int/enum now. Long term I will just run this query on an hourly cron and save into another table from which my report can be generated from (it doesn't have to be instantaneous).

Question 13

@S.. No problem! Sounds good. Materializing the data to another table will be your best bet for performance (at the tradeoff of how real-time the data is kept up to date).

Question 14

If using a CHAR(16) switching to INT means the date,service index goes form 24 bytes per row to 8, so there would be a lot less data read to produce the result. Starting from VARCHAR may reduce that effect depending on the length of those strings (though I assume most/all are longer than 4 bytes?) but adds a little CPU work to unpack the variable length rows. tl;dr: switching to an integer identifier is likely to help noticeably and may help a lot.

Question 15

@DavidSpillett Eh I'd be surprised if the difference would be much measurably different. Maybe in a SQLite database on a mobile device, but even for MySQL on a regular server, I think the difference in bytes (and to your own point, it's a VARCHAR so it's unlikely all 16 bytes are completely consumed in every row) I think the proper indexing on both fields themselves is going to be magnitudes better than changing data types here. My guess is maybe there'll be a second or two difference?...because on the whole, the difference of 16 bytes across 3 million rows is only an extra ~50 MB of data...

Question 16

...that needs to be processed by the server and aggregated (not even returned to the client). Hopefully modern CPUs and RAM can handle an extra 50 MB in sub-second time. 🙂

Question 17

Building on great suggestions from @J.D. and @David Spillett here's what I did. I managed to improve the query from ~28 seconds to ~2 seconds which was enough for now. Later I will look to either delete data older than I will query for, or run this query on a cron job which stores it into another table to retrieve from in realtime instead.

I remove the index on service and the index on datetime, adding a composite index on (datetime, service) instead. Ordering of the columns in that index matters. This took the query from ~28s to ~14s.
I replace the service column (varchar(16)) with an unsigned tinyint and mapped it to a string in my app. This was possible for me since the number services was known to me and would change very rarely. This took the query from ~14s to ~2s.

Question 18

I'm honestly very surprised by the amount of performance improvement you saw by changing the data type (perhaps MySQL is more fussy on that stuff though), but there's a reason I don't know everything lol. Glad it all worked out so well for you though!

score 2 · Answer 1 · 2021-03-16 11:40:40Z

For a relatively simple query like that there isn't a lot you can do to optimise it as there is little room for change (there isn't a simple way of asking for that data).

You can probably reduce the amount of pages being touched as it runs significantly by having an index on datetime and service. This way the data it needs to group by will already be available in what it has read to perform the filter on the date. This will increase the amount of data on disk as the index will be larger, and slow down writes a touch for the same reason. You would probably want to replace the existing index on datetime with the new composite index instead of just adding it, for those reasons.

A little more detail on why this will be faster:

With the index just on the datetime, it will need to read the base table data pages to get the service column for each matching row which as well as being extra page reads just because of referencing the other structure it will possibly be many more page reads because those pages contain less rows each due to the larger data per row (they include the URL column and any other properties that you add, or might already have, for the services).

If that still isn't fast enough...

You may need to look at some form of caching for the counts. There are several options here:

a materialised view may work if they are intelligent enough in mysql
a little denormalising, by including a counts table updated by trigger when the main table is updated (this is essentially a more "manual" version of a materialised view)
something in the application layer, if you can live with the result not necessarily being 100% up-to-date every time (the fastest option by far if done right, but obviously with that key disadvantage)
if the data is big enough that the query is reading from storage rather than RAM each time and you don't want to use one/more of the three options above: throw hardware at the problem and buy oodles of RAM! (this is usually not a good solution, though sometimes can be)

Thanks @David, I added an index across two columns and the query went from 28 to 16 seconds.
Specifying the order of the columns in the index is important. I believe (service,datetime) would be best. (That is, service being the leftmost column in the index.)
@WillemRenzema - no, datetime,service should be faster as it will filter by date before grouping. With datetime,service you should get a seek followed by a partial index scan (from the point the seek found), with service,datetime it'll likely perform a full index scan unless it realises a skip-scan can be used (but that will likely still be slower I think).
@DavidSpillett Perhaps. I personally would try both and see which is better, but you may be right about datetime first being better, so I won't argue that. However, I think any mention of a composite index should be clear about what column order you are suggesting. Your answer currently looks non-specific on the order, as it is just saying the index should be on both columns.

J.D. J.D. 41.1k12 gold badges64 silver badges145 bronze badges · Answer 2 · 2021-03-16 11:47:17Z

Unfortunately MySQL is sometimes a little more limited in it's options compared to other modern RDBMS. One common way to solve the problem you're facing in other systems is to use something called a Materialized View. While not officially a feature of MySQL, you can replicate the behavior with a bit of coding as demonstrated in Speeding Up MySQL Using Materialized Views and MATERIALIZED VIEWS WITH MYSQL.

You may also find some useful information from this DBA.StackExchange answer which gives some alternatives such as creating summary tables. Of course that means you'll need to maintain the data in two places, but you can automate this with Triggers.

Finally, as I prompted in the comments, if your table currently has two separate indexes on the datetime field and the service field, only one of those indexes can be used at a time to serve your query. The most optimal index to improve your query would likely involve creating one index on datetime, service so that way after your datetime field filters down the results, the remaining rows it returns already has the service field inclusive in the index, ready to go for the grouping.

Also to answer your question in the comments, VARCHAR(16) isn't a terrible data type to index, it's just 4x as big as an INT, for example. I doubt you'll see game changing performance by changing the data type but you can experiment with switching to an INT and having a reference table with the actual service names stored in it (with a foreign key relationship from your main table). You could also try the ENUM data type but I'm not personally familiar with it and have heard general recommendations against.

Thanks a lot for all that. I have added a composite (datetime, service) index which increased performance by 50%, I will experiment with changing service to an int/enum now. Long term I will just run this query on an hourly cron and save into another table from which my report can be generated from (it doesn't have to be instantaneous).
@S.. No problem! Sounds good. Materializing the data to another table will be your best bet for performance (at the tradeoff of how real-time the data is kept up to date).
If using a CHAR(16) switching to INT means the date,service index goes form 24 bytes per row to 8, so there would be a lot less data read to produce the result. Starting from VARCHAR may reduce that effect depending on the length of those strings (though I assume most/all are longer than 4 bytes?) but adds a little CPU work to unpack the variable length rows. tl;dr: switching to an integer identifier is likely to help noticeably and may help a lot.
@DavidSpillett Eh I'd be surprised if the difference would be much measurably different. Maybe in a SQLite database on a mobile device, but even for MySQL on a regular server, I think the difference in bytes (and to your own point, it's a VARCHAR so it's unlikely all 16 bytes are completely consumed in every row) I think the proper indexing on both fields themselves is going to be magnitudes better than changing data types here. My guess is maybe there'll be a second or two difference?...because on the whole, the difference of 16 bytes across 3 million rows is only an extra ~50 MB of data...
...that needs to be processed by the server and aggregated (not even returned to the client). Hopefully modern CPUs and RAM can handle an extra 50 MB in sub-second time. 🙂

S.. S.. 1114 bronze badges · Answer 3 · 2021-03-16 15:05:23Z

Building on great suggestions from @J.D. and @David Spillett here's what I did. I managed to improve the query from ~28 seconds to ~2 seconds which was enough for now. Later I will look to either delete data older than I will query for, or run this query on a cron job which stores it into another table to retrieve from in realtime instead.

I remove the index on service and the index on datetime, adding a composite index on (datetime, service) instead. Ordering of the columns in that index matters. This took the query from ~28s to ~14s.
I replace the service column (varchar(16)) with an unsigned tinyint and mapped it to a string in my app. This was possible for me since the number services was known to me and would change very rarely. This took the query from ~14s to ~2s.

I'm honestly very surprised by the amount of performance improvement you saw by changing the data type (perhaps MySQL is more fussy on that stuff though), but there's a reason I don't know everything lol. Glad it all worked out so well for you though!

Stack Exchange Network

How to re-write this query or restructure my table for SPEED

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

How to re-write this query or restructure my table for SPEED

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions