I have an InnoDB MySQL table of this structure:
column | type |
---|---|
service | varchar(16) |
url | varchar(255) |
datetime | datetime |
It contains logs for requests made to one of multiple services. It contains ~3 million rows, and gets ~3 million more rows a month (but I will delete old data if need be).
I am trying to generate a report, and get the number of requests made to each service within a date range.
Here is my current query:
SELECT service, COUNT(*) as lastMonthCount
FROM request_logs
WHERE datetime > '2021-02-16 10:51:05'
GROUP BY service
This works, but is painfully slow (~28 seconds).
It outputs this:
service | lastMonthCount |
---|---|
API A | 3056752 |
API B | 38451 |
I have indexes on datetime
and service
. I can see they are of type BTREE
.
How can I radically speed up this query, or restructure my table/indexes so I can achieve the same use case another way?
3 Answers 3
For a relatively simple query like that there isn't a lot you can do to optimise it as there is little room for change (there isn't a simple way of asking for that data).
You can probably reduce the amount of pages being touched as it runs significantly by having an index on datetime
and service
. This way the data it needs to group by will already be available in what it has read to perform the filter on the date. This will increase the amount of data on disk as the index will be larger, and slow down writes a touch for the same reason. You would probably want to replace the existing index on datetime
with the new composite index instead of just adding it, for those reasons.
A little more detail on why this will be faster:
With the index just on the datetime
, it will need to read the base table data pages to get the service
column for each matching row which as well as being extra page reads just because of referencing the other structure it will possibly be many more page reads because those pages contain less rows each due to the larger data per row (they include the URL column and any other properties that you add, or might already have, for the services).
If that still isn't fast enough...
You may need to look at some form of caching for the counts. There are several options here:
- a materialised view may work if they are intelligent enough in mysql
- a little denormalising, by including a counts table updated by trigger when the main table is updated (this is essentially a more "manual" version of a materialised view)
- something in the application layer, if you can live with the result not necessarily being 100% up-to-date every time (the fastest option by far if done right, but obviously with that key disadvantage)
- if the data is big enough that the query is reading from storage rather than RAM each time and you don't want to use one/more of the three options above: throw hardware at the problem and buy oodles of RAM! (this is usually not a good solution, though sometimes can be)
-
Thanks @David, I added an index across two columns and the query went from 28 to 16 seconds.S..– S..2021年03月16日 11:51:53 +00:00Commented Mar 16, 2021 at 11:51
-
1Specifying the order of the columns in the index is important. I believe
(service,datetime)
would be best. (That is,service
being the leftmost column in the index.)Willem Renzema– Willem Renzema2021年03月16日 11:54:00 +00:00Commented Mar 16, 2021 at 11:54 -
1@WillemRenzema - no,
datetime,service
should be faster as it will filter by date before grouping. Withdatetime,service
you should get a seek followed by a partial index scan (from the point the seek found), withservice,datetime
it'll likely perform a full index scan unless it realises a skip-scan can be used (but that will likely still be slower I think).David Spillett– David Spillett2021年03月16日 11:59:24 +00:00Commented Mar 16, 2021 at 11:59 -
1@DavidSpillett Perhaps. I personally would try both and see which is better, but you may be right about
datetime
first being better, so I won't argue that. However, I think any mention of a composite index should be clear about what column order you are suggesting. Your answer currently looks non-specific on the order, as it is just saying the index should be on both columns.Willem Renzema– Willem Renzema2021年03月16日 12:32:06 +00:00Commented Mar 16, 2021 at 12:32
Unfortunately MySQL is sometimes a little more limited in it's options compared to other modern RDBMS. One common way to solve the problem you're facing in other systems is to use something called a Materialized View. While not officially a feature of MySQL, you can replicate the behavior with a bit of coding as demonstrated in Speeding Up MySQL Using Materialized Views and MATERIALIZED VIEWS WITH MYSQL.
You may also find some useful information from this DBA.StackExchange answer which gives some alternatives such as creating summary tables. Of course that means you'll need to maintain the data in two places, but you can automate this with Triggers.
Finally, as I prompted in the comments, if your table currently has two separate indexes on the datetime
field and the service
field, only one of those indexes can be used at a time to serve your query. The most optimal index to improve your query would likely involve creating one index on datetime, service
so that way after your datetime
field filters down the results, the remaining rows it returns already has the service
field inclusive in the index, ready to go for the grouping.
Also to answer your question in the comments, VARCHAR(16)
isn't a terrible data type to index, it's just 4x as big as an INT
, for example. I doubt you'll see game changing performance by changing the data type but you can experiment with switching to an INT
and having a reference table with the actual service
names stored in it (with a foreign key relationship from your main table). You could also try the ENUM
data type but I'm not personally familiar with it and have heard general recommendations against.
-
1Thanks a lot for all that. I have added a composite (datetime, service) index which increased performance by 50%, I will experiment with changing service to an int/enum now. Long term I will just run this query on an hourly cron and save into another table from which my report can be generated from (it doesn't have to be instantaneous).S..– S..2021年03月16日 12:14:25 +00:00Commented Mar 16, 2021 at 12:14
-
1@S.. No problem! Sounds good. Materializing the data to another table will be your best bet for performance (at the tradeoff of how real-time the data is kept up to date).J.D.– J.D.2021年03月16日 12:16:50 +00:00Commented Mar 16, 2021 at 12:16
-
1If using a
CHAR(16)
switching toINT
means thedate,service
index goes form 24 bytes per row to 8, so there would be a lot less data read to produce the result. Starting fromVARCHAR
may reduce that effect depending on the length of those strings (though I assume most/all are longer than 4 bytes?) but adds a little CPU work to unpack the variable length rows. tl;dr: switching to an integer identifier is likely to help noticeably and may help a lot.David Spillett– David Spillett2021年03月16日 14:02:52 +00:00Commented Mar 16, 2021 at 14:02 -
@DavidSpillett Eh I'd be surprised if the difference would be much measurably different. Maybe in a SQLite database on a mobile device, but even for MySQL on a regular server, I think the difference in bytes (and to your own point, it's a
VARCHAR
so it's unlikely all 16 bytes are completely consumed in every row) I think the proper indexing on both fields themselves is going to be magnitudes better than changing data types here. My guess is maybe there'll be a second or two difference?...because on the whole, the difference of 16 bytes across 3 million rows is only an extra ~50 MB of data...J.D.– J.D.2021年03月16日 14:14:36 +00:00Commented Mar 16, 2021 at 14:14 -
...that needs to be processed by the server and aggregated (not even returned to the client). Hopefully modern CPUs and RAM can handle an extra 50 MB in sub-second time. 🙂J.D.– J.D.2021年03月16日 14:15:32 +00:00Commented Mar 16, 2021 at 14:15
Building on great suggestions from @J.D. and @David Spillett here's what I did. I managed to improve the query from ~28 seconds to ~2 seconds which was enough for now. Later I will look to either delete data older than I will query for, or run this query on a cron job which stores it into another table to retrieve from in realtime instead.
- I remove the index on service and the index on datetime, adding a composite index on (datetime, service) instead. Ordering of the columns in that index matters. This took the query from ~28s to ~14s.
- I replace the service column (varchar(16)) with an unsigned tinyint and mapped it to a string in my app. This was possible for me since the number services was known to me and would change very rarely. This took the query from ~14s to ~2s.
-
1I'm honestly very surprised by the amount of performance improvement you saw by changing the data type (perhaps MySQL is more fussy on that stuff though), but there's a reason I don't know everything lol. Glad it all worked out so well for you though!J.D.– J.D.2021年03月16日 21:50:06 +00:00Commented Mar 16, 2021 at 21:50
Explore related questions
See similar questions with these tags.
datetime
and a separate index forservice
resulting in a total of 2 indexes, or do you have 1 single index on both fields? 28 seconds actually doesn't sound too bad for MySQL for that many rows, especially with aVARCHAR(16)
being one of the indexed predicates, but I'll think if there's an architectural improvement you can make, as it sounds like you're close as far as indexes go.VARCHAR(16)
, that could be changed, since the services are known and won't change often. What do you think would be best, an Enum or?DELETE
.