In Postgres, how can I retrieve the top N rows of each ranking partition without ranking the entire table?

Question 1

Summary

I have a table with event sources and a table with events. I need a query which gives me the N most recent events from each source (N is anywhere from 1 to 100). Currently, I'm doing this with a subquery that performs a ROW_NUMBER() OVER (PARTITION BY "EventSourceId" ORDER BY ...) as rankRecent and an outer query that filters WHERE rankRecent <= @N.

The EXPLAIN ANALYZE results say that it is using my index for the partition and order by clauses, but it is still ranking the entire table and apparently expecting to find 6 million results, but there are only 22 thousand. I am trying to find if there is either: (1) a better way of getting the N most recent events for each event source, or (2) a way to hint to the query planner that it doesn't need to strictly rank most of the table since only the first few entries will be used.

Also, there is a second use case for the query that I don't even know how to begin indexing for. That is not the thrust of this question; I only mention it in the interest of including everything that may be relevant.

Details

Data Setup

CREATE TABLE "EventSources"
(
 "Id" uuid NOT NULL,
 "Name" character varying(100),
 CONSTRAINT "PK_EventSources" PRIMARY KEY ("Id")
);
CREATE TABLE "Events"
(
 "Id" uuid NOT NULL,
 "EventSourceId" uuid NOT NULL,
 "Time" timestamp with time zone,
 "AltKey" character varying(100),
 CONSTRAINT "PK_Events" PRIMARY KEY ("Id"),
 CONSTRAINT "FK_Events_EventSources_EventSourceId" FOREIGN KEY ("EventSourceId") REFERENCES "EventSources" ("Id") MATCH SIMPLE ON UPDATE NO ACTION ON DELETE RESTRICT
);
CREATE INDEX "IX_Events_EventSourceId_Time_Desc_AltKey_Desc" ON "Events" USING btree
(
 "EventSourceId" ASC,
 "Time" DESC NULLS LAST,
 "AltKey" DESC NULLS LAST
);

Some additional possibly-relevant info:

SELECT version(); --PostgreSQL 13.12, compiled by Visual C++ build 1914, 64-bit
SELECT COUNT(*) FROM "EventSources"; --29,000ish
SELECT COUNT(*) FROM "Events"; --20,000,000ish
SELECT COUNT(*) FROM (SELECT DISTINCT "EventSourceId" FROM "Events") sub; --5,000ish. Most of the "EventSources" don't have "Events" but are used for other things in the db

The Query

Here is the query I'm trying to optimize:

SELECT
 *
FROM
 (
 SELECT
 "Events".*,
 ROW_NUMBER() OVER (
 PARTITION BY "Events"."EventSourceId"
 ORDER BY
 "Events"."Time" DESC NULLS LAST,
 "Events"."AltKey" DESC NULLS LAST
 ) as rankRecent
 FROM
 "Events"
 --WHERE "Events"."Time" < @LimitTime
 ) sub
WHERE
 rankRecent <= @N; -- @N is in the range 1 to 100.

Use Cases

Here are the Use Cases for the query:

I'm loading a dashboard which displays aggregated calculations of recent data for each event source and based on those calculations chooses which event sources to display and in which order.
I'm investigating an issue that happened yesterday at 3:14:15 a.m., and--for a given collection of relevant event sources--I need to see the 100 events leading up to that time so that I can see what might have been going wrong. In this case, the commented WHERE clause in the query will be uncommented, and it may also join to another table in order to filter for event sources that are related to a particular context. I have no idea how to index for this, but this isn't the thrust of the question.

Explain

Here is the result of EXPLAIN (ANALYZE, BUFFERS, SETTINGS) (in this case, @N was set to 5):

Subquery Scan on sub (cost=0.56..2637123.43 rows=6664000 width=66) (actual time=0.156..90245.642 rows=22613 loops=1)
- Filter: (sub.rankrecent <= 5)
- Rows Removed by Filter: 19963368
- Buffers: shared hit=6738934 read=13332278
- -> WindowAgg (cost=0.56..2387223.43 rows=19992000 width=66) (actual time=0.155..89355.268 rows=19985981 loops=1)
  - Buffers: shared hit=6738934 read=13332278
    - -> Index Scan using "IX_Events_EventSourceId_Time_Desc_AltKey_Desc" on "Events" (cost=0.56..1987383.43 rows=19992000 width=58) (actual time=0.100..82274.745 rows=19985981 loops=1)
      - Buffers: shared hit=6738934 read=13332278
Planning Time: 0.111 ms
Execution Time: 90247.357 ms

Alternate Approaches I've Considered

I considered using a materialized view to do the heavy lifting beforehand. However, the "Events" table will have data inserted to it about a dozen times per second, and each such transaction can contain anywhere from zero to a few dozen rows. Therefore, if the view uses the same query plan as the existing query, then the data in the view will be entirely out of date before a command to refresh the view can complete. So I concluded there was no point to this strategy.
I also considered trying to add a new integer sequence column on the "Events" table, then filtering on that sequence. But that doesn't really solve the problem. Each transaction that inserts events may or may not have events for a given event source, so the 5 most recent events for a given event source may include data from, for example, a transaction from 5 seconds ago, and a transaction from an hour ago, and one from yesterday and one from last week and one from last year. So there's no easy way to synchronize such a sequence across all the thousands of different event sources. Further, it's possible for an event source to manually submit data that was missed at an arbitrary time in the past, so sorting on "Time" seems to be right.
I considered having a table that only tracks, say, the 50 most recent events for each event source and drops anything older. That way, the query could go ahead and rank the entire table because there wouldn't be that much data to rank. However, the query would no longer satisfy Use Case #2; and, while Use Case #2 isn't the thrust of this question, I can't just remove it from the query without providing a replacement.

Clarification

The row_number over partition question is the main question here. The stuff about Use Case #2 and not knowing how to index for it, is included here in case it inspires someone to realize a better plan or query or schema; I'm not expecting a solution for it, just wanting to include any information that may be relevant.

Question 2

How do you suppose Postgres would know what "the first few entries" were without sorting them first? Have you tried WHERE "Events"."Time" > <some reasonable cut-off time>?

Question 3

@mustaccio I was hoping it would run some logic like this: "foreach eventsourceid represented in the table: use the index to identify the top N rows from that eventsourceid without needing to consider or visit the remaining records". I.e. each eventsourceid's partition is a segment of the index that's already sorted the way I want; I'd like it to use that to only gather the needed records instead of manually ranking the entire table, just like you can "order by ... limit 10" to get the first 10 records without it having to touch the remaining records.

Question 4

"Therefore, if the view uses the same query plan as the existing query, then the data in the view will be entirely out of date before a command to refresh the view can complete." Doesn't that call the whole endeavor into question? I would think it will also be completely out of date before anyone has a chance to look at the rendered dashboard and do anything with it.

Question 5

@jjanes Some of the event sources will be outdated, to be sure. But the app can merge that with its own stream of live events and come up with a result that's displayable, as long as I can ensure that one of the sources (live stream or the db query) has the most updated value for each source, I should be fine.

Question 6

A new optimization was introduced in v15 which allows it to stop computing the ranks after it finds the needed number of rows for each partition. This is not magic, it still needs to read all the rows because it doesn't know where the next partition starts other than by reading the rows until the partition key changes. But it should at least shave off some time. When this optimization is in effect, it shows up in the plan with a line like:

 WindowAgg (cost=...
 Run Condition: (row_number() OVER (?) <= 100)

You could probably improve on that by implementing an index skip scan. PostgreSQL doesn't implement those automatically, but you can do it with a recursive CTE, as shown on the project wiki (Although it looks like maybe someone goobered it up a bit since last time I visited it).

Since you already have a table of distinct event sources, you could make use of that with a lateral join, rather than using the CTE, like this: (I'm not going to hop all over my keyboard with the capitals and quotes, I leave that to you)

select t.* from
 eventsources cross join lateral
 (select * from events 
 where eventsourceid=eventsources.id 
 order by "Events"."Time" DESC NULLS LAST, "Events"."AltKey" DESC NULLS LAST 
 LIMIT 100
 ) t

The unused rows in EventSources will slow this down a bit, but it should still probably be much faster than you are currently doing.

For your related question, I don't know I haven't read it carefully. Ask another question if you have another question, once you verified that this approach works for you on this one.

Question 7

Excellent. The lateral join took it from 90 seconds to under 1 second. Thank you so much.

jjanes jjanes 42.4k3 gold badges44 silver badges54 bronze badges · Accepted Answer · 2023-10-12 21:07:53Z

A new optimization was introduced in v15 which allows it to stop computing the ranks after it finds the needed number of rows for each partition. This is not magic, it still needs to read all the rows because it doesn't know where the next partition starts other than by reading the rows until the partition key changes. But it should at least shave off some time. When this optimization is in effect, it shows up in the plan with a line like:

 WindowAgg (cost=...
 Run Condition: (row_number() OVER (?) <= 100)

You could probably improve on that by implementing an index skip scan. PostgreSQL doesn't implement those automatically, but you can do it with a recursive CTE, as shown on the project wiki (Although it looks like maybe someone goobered it up a bit since last time I visited it).

Since you already have a table of distinct event sources, you could make use of that with a lateral join, rather than using the CTE, like this: (I'm not going to hop all over my keyboard with the capitals and quotes, I leave that to you)

select t.* from
 eventsources cross join lateral
 (select * from events 
 where eventsourceid=eventsources.id 
 order by "Events"."Time" DESC NULLS LAST, "Events"."AltKey" DESC NULLS LAST 
 LIMIT 100
 ) t

The unused rows in EventSources will slow this down a bit, but it should still probably be much faster than you are currently doing.

For your related question, I don't know I haven't read it carefully. Ask another question if you have another question, once you verified that this approach works for you on this one.

Excellent. The lateral join took it from 90 seconds to under 1 second. Thank you so much.

Stack Exchange Network

In Postgres, how can I retrieve the top N rows of each ranking partition without ranking the entire table?

Summary

Details

Data Setup

The Query

Use Cases

Explain

Alternate Approaches I've Considered

Clarification

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

In Postgres, how can I retrieve the top N rows of each ranking partition without ranking the entire table?

Summary

Details

Data Setup

The Query

Use Cases

Explain

Alternate Approaches I've Considered

Clarification

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions