I need to query the latest entry from a table with>130mn entries using PostgreSQL 9.5. There are two columns of interest, row_id (integer, the primary key), and row_time (timestamp, not null, with btree index).
This query works very fast:
EXPLAIN SELECT * FROM tbl ORDER BY row_time DESC LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Limit (cost=0.57..0.88 rows=1 width=423)
-> Index Scan Backward using tbl_row_time on tbl (cost=0.57..40503877.59 rows=131390688 width=423)
In case there are several rows with the same row_time I would like to sort by row_id as a tie-breaker. Unfortunately, the query plan then regresses to a sequential scan:
EXPLAIN SELECT * FROM tbl ORDER BY row_time DESC, row_id DESC LIMIT 1;
QUERY PLAN
---------------------------------------------------------------------------------
Limit (cost=9602337.32..9602337.32 rows=1 width=423)
-> Sort (cost=9602337.32..9930814.04 rows=131390688 width=423)
Sort Key: row_time DESC, row_id DESC
-> Seq Scan on tbl (cost=0.00..8945383.88 rows=131390688 width=423)
I don't understand why this is happening. It would be easy to retrieve the rows with the maximum row_time using the index (typically less than 10 rows) and then sort just these. Instead, the full table is scanned, which takes minutes instead of the sub-millisecond time of the first query.
Creating a multicolumn index on (row_time, row_id) makes PostgreSQL chose an index scan again, but increases the size of the index considerably. A subquery to get MAX(row_time), then filter by it and sort the result by row_id is also fast but seems overly complicated for my simple goal. Am I missing something here?
1 Answer 1
If you create an index on (row_time DESC, row_id DESC)
, then the second case will will act just as the first one.
LIMIT
will always operate after sorting is taken care of, so when sorting without an index is necessary, it will sort the entire recordset prior to processing it through LIMIT.
I wonder what is it you want to achieve? Applying LIMIT over a primary indexed order, then including all rows matching the sorted field(s) in the limited result, and finally sorting the limited results by any other fields and limit again. This is the way to go in this case:
SELECT *
FROM tbl
WHERE row_time = (
SELECT row_time
FROM tbl
ORDER BY row_time DESC
LIMIT 1)
ORDER BY row_id DESC
LIMIT 1;
Whereas for LIMIT n
where n
> 1:
SELECT *
FROM tbl
WHERE row_time IN (
SELECT row_time
FROM tbl
ORDER BY row_time DESC
LIMIT n)
ORDER BY row_id DESC
LIMIT n;
-
1@ypercubeTM not if
row_time
is not unique (which in this case is expected because there's an extra ORDER BY element)Ezequiel Tolnay– Ezequiel Tolnay2016年06月14日 22:08:33 +00:00Commented Jun 14, 2016 at 22:08 -
@ypercubeTM ah, gotcha, I get what you mean. I fixed the code :)Ezequiel Tolnay– Ezequiel Tolnay2016年06月15日 07:07:26 +00:00Commented Jun 15, 2016 at 7:07
-
And the
IN
could be=
;)ypercubeᵀᴹ– ypercubeᵀᴹ2016年06月15日 07:10:05 +00:00Commented Jun 15, 2016 at 7:10
Explore related questions
See similar questions with these tags.
rows=131390688
rows. Something does not add up here, or your table statistics are seriously off track. Is autovacuum on? / Have you tried a manualANALYZE tbl
?