Improving `GROUP BY` query performance in sqlite3

Question 1

I have a little web-application that is using sqlite3 as it's DB (the db is fairly small).

Right now, I am generating some content to display using the following query:

SELECT dbId,
 dlState,
 retreivalTime,
 seriesName,
 <snip irrelevant columns>
 FROM DataItems
 GROUP BY seriesName
 ORDER BY retreivalTime DESC
 LIMIT ?
 OFFSET ?;

Where limit is typically ~200, and offset is 0 (they drive a pagination mechanism).

Anyways, right now, this one query is completely killing my performance. It takes approximately 800 milliseconds to execute on a table with ~67K rows.

I have indexes on both seriesName and retreivalTime.

sqlite> SELECT name FROM sqlite_master WHERE type='index' ORDER BY name;
<snip irrelevant indexes>
DataItems_seriesName_index
DataItems_time_index // This is the index on retreivalTime. Yeah, it's poorly named

However, EXPLAIN QUERY PLAN seems to indicate they're not being used:

sqlite> EXPLAIN QUERY PLAN SELECT dbId, 
 dlState, 
 retreivalTime, 
 seriesName 
 FROM 
 DataItems 
 GROUP BY 
 seriesName 
 ORDER BY 
 retreivalTime 
 DESC LIMIT 200 OFFSET 0;
0|0|0|SCAN TABLE DataItems
0|0|0|USE TEMP B-TREE FOR GROUP BY
0|0|0|USE TEMP B-TREE FOR ORDER BY

The index on seriesName is COLLATE NOCASE, if that's relevant.

If I drop the GROUP BY, it behaves as expected:

sqlite> EXPLAIN QUERY PLAN SELECT dbId, dlState, retreivalTime, seriesName FROM DataItems ORDER BY retreivalTime DESC LIMIT 200 OFFSET 0;
0|0|0|SCAN TABLE DataItems USING INDEX DataItems_time_index

Basically, my naive assumption would be that the best way to perform this query would be to walk backwards from latest value in retreivalTime, and every time a new value for seriesName is seen, append it to a temporary list, and finally return that value. That would have somewhat poor performance for cases where OFFSET is large, but that happens very rarely in this application.

How can I optimize this query? I can provide the raw query operations if needed.

Insert performance is not critical here, so if I need to create an additional index or two, that's fine.

My current thoughts are a commit-hook that updates a separate table that is used to track only unique items, but that seems like overkill.

Question 2

Before optimizing it, make it correct. Using GROUP BY seriesName and then SELECT various columns without aggregation, will give you indeterminate results.

Question 3

@ypercube - I don't actually care if it's correct in that way. Any of the rows flattened by the GROUP BY are acceptable.

Question 4

The problem is far more worse. You can have rows in the result that include values from different rows (so rows that do not actually exist in the table! At least in MySQL's implementation, not sure about SQLite.)

Question 5

And if you do want (just 1 row per seriesName), why have ORDER BY and OFFSET? Just use LIMIT 200.

Question 6

It seems to work fine in sqlite. I think it just returns the first row it sees for the GROUP BY query.

Question 7

Here is a suggestion: add an index on (seriesName, retreivalTime) and try this query. It won't be super fast but probably more efficient than what you have:

SELECT d.dbId,
 d.dlState,
 d.retreivalTime,
 d.seriesName,
 <snip irrelevant columns>
FROM DataItems AS d
 JOIN
 ( SELECT seriesName, 
 MAX(retreivalTime) AS max_retreivalTime
 FROM DataItems
 GROUP BY seriesName
 ORDER BY max_retreivalTime DESC
 LIMIT ?
 OFFSET ?
 ) AS di
 ON di.seriesName = d.seriesName
 AND di.max_retreivalTime = d.retreivalTime
ORDER BY di.max_retreivalTime ;

Or (variation) using the PK as well, with an index on (seriesName, retreivalTime, dbId) and query:

SELECT d.dbId,
 d.dlState,
 d.retreivalTime,
 d.seriesName,
 <snip irrelevant columns>
FROM DataItems AS d
 JOIN
 ( SELECT dbId
 FROM DataItems
 GROUP BY seriesName
 ORDER BY MAX(retreivalTime) DESC
 LIMIT ?
 OFFSET ?
 ) AS di
 ON di.dbId = d.dbId
ORDER BY d.max_retreivalTime ;

The logic behind the queries is to use only the index for the derived table calculations (finding the max(retreival-time) for every seriesName and then order by and do the offset-limit thing.)

Then the table itself will be involved only for fetching those 200 rows that are to be displayed.

Question 8

dbId is a PRIMARY KEY, so I think using that to JOIN on might be a better option? Does looking up the row using a single key have sufficient performance improvement to counteract having to select a third column?

Question 9

I can create a index on all three.

Question 10

Anyways, I'll do some testing tomorrow.

Question 11

@FakeName Will you need the last ORDER BY ORDER BY di.max_retreivalTime;? I can add a version with the PK.

Question 12

Yeah, either way, I need to order by the time.

Question 13

An index can be used to optimize the GROUP BY, but if the ORDER BY uses different columns, the sorting cannot use an index (because an index would help only if the database would be able to read the rows from the table in the sort order).

A COLLATE NOCASE index does not help if you use a different collation in the query. Add a 'normal' index, or use GROUP BY seriesName COLLATE NOCASE, if that is allowed.

Using the OFFSET clause for pagination is not very efficient, because the database still has to group and sort all rows before it can begin stepping over them. Better use a scrolling cursor.

Note: there is no guarantee that the dbId and dlState values come from any specific row; SQLite allows non-aggregated columns in an aggregated query only for bug compatibility with MySQL.

Question 14

I like this phrase "bug compatibility". So, SQLite tries to be compatible with MySQL's bugs, nice!

Question 15

The reason I'm using offset is because that way, my pagination can be completely stateless. I did some testing, and without any offset, my query time is ~0.158 seconds. With an offset of ~50000, it's ~0.163-0.167 seconds. Considering the complexity of reworking everything to provide persistent sessions to the clients, for the ~5 millisecond speedup, it's not worth it.

Question 16

The current offset is state.

Question 17

@CL. - I mean state on the server. Right now, the rendered page has links that include the offset value. There is no persistent state on the server.

Question 18

With a scrolling cursor, the link would not have an offset value but the column value at which to continue.

ypercubeTM ypercubeTM 99.7k13 gold badges217 silver badges306 bronze badges · Accepted Answer · 2014-07-25 11:27:45Z

Here is a suggestion: add an index on (seriesName, retreivalTime) and try this query. It won't be super fast but probably more efficient than what you have:

SELECT d.dbId,
 d.dlState,
 d.retreivalTime,
 d.seriesName,
 <snip irrelevant columns>
FROM DataItems AS d
 JOIN
 ( SELECT seriesName, 
 MAX(retreivalTime) AS max_retreivalTime
 FROM DataItems
 GROUP BY seriesName
 ORDER BY max_retreivalTime DESC
 LIMIT ?
 OFFSET ?
 ) AS di
 ON di.seriesName = d.seriesName
 AND di.max_retreivalTime = d.retreivalTime
ORDER BY di.max_retreivalTime ;

Or (variation) using the PK as well, with an index on (seriesName, retreivalTime, dbId) and query:

SELECT d.dbId,
 d.dlState,
 d.retreivalTime,
 d.seriesName,
 <snip irrelevant columns>
FROM DataItems AS d
 JOIN
 ( SELECT dbId
 FROM DataItems
 GROUP BY seriesName
 ORDER BY MAX(retreivalTime) DESC
 LIMIT ?
 OFFSET ?
 ) AS di
 ON di.dbId = d.dbId
ORDER BY d.max_retreivalTime ;

The logic behind the queries is to use only the index for the derived table calculations (finding the max(retreival-time) for every seriesName and then order by and do the offset-limit thing.)

Then the table itself will be involved only for fetching those 200 rows that are to be displayed.

dbId is a PRIMARY KEY, so I think using that to JOIN on might be a better option? Does looking up the row using a single key have sufficient performance improvement to counteract having to select a third column?
@FakeName Will you need the last ORDER BY ORDER BY di.max_retreivalTime;? I can add a version with the PK.

Stack Exchange Network

Improving `GROUP BY` query performance in sqlite3

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Improving `GROUP BY` query performance in sqlite3

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions