Summary
I have a table with event sources and a table with events. I need a query which gives me the N
most recent events from each source (N
is anywhere from 1 to 100). Currently, I'm doing this with a subquery that performs a ROW_NUMBER() OVER (PARTITION BY "EventSourceId" ORDER BY ...) as rankRecent
and an outer query that filters WHERE rankRecent <= @N
.
The EXPLAIN ANALYZE
results say that it is using my index for the partition
and order by
clauses, but it is still ranking the entire table and apparently expecting to find 6 million results, but there are only 22 thousand. I am trying to find if there is either: (1) a better way of getting the N
most recent events for each event source, or (2) a way to hint to the query planner that it doesn't need to strictly rank most of the table since only the first few entries will be used.
Also, there is a second use case for the query that I don't even know how to begin indexing for. That is not the thrust of this question; I only mention it in the interest of including everything that may be relevant.
Details
Data Setup
CREATE TABLE "EventSources"
(
"Id" uuid NOT NULL,
"Name" character varying(100),
CONSTRAINT "PK_EventSources" PRIMARY KEY ("Id")
);
CREATE TABLE "Events"
(
"Id" uuid NOT NULL,
"EventSourceId" uuid NOT NULL,
"Time" timestamp with time zone,
"AltKey" character varying(100),
CONSTRAINT "PK_Events" PRIMARY KEY ("Id"),
CONSTRAINT "FK_Events_EventSources_EventSourceId" FOREIGN KEY ("EventSourceId") REFERENCES "EventSources" ("Id") MATCH SIMPLE ON UPDATE NO ACTION ON DELETE RESTRICT
);
CREATE INDEX "IX_Events_EventSourceId_Time_Desc_AltKey_Desc" ON "Events" USING btree
(
"EventSourceId" ASC,
"Time" DESC NULLS LAST,
"AltKey" DESC NULLS LAST
);
Some additional possibly-relevant info:
SELECT version(); --PostgreSQL 13.12, compiled by Visual C++ build 1914, 64-bit
SELECT COUNT(*) FROM "EventSources"; --29,000ish
SELECT COUNT(*) FROM "Events"; --20,000,000ish
SELECT COUNT(*) FROM (SELECT DISTINCT "EventSourceId" FROM "Events") sub; --5,000ish. Most of the "EventSources" don't have "Events" but are used for other things in the db
The Query
Here is the query I'm trying to optimize:
SELECT
*
FROM
(
SELECT
"Events".*,
ROW_NUMBER() OVER (
PARTITION BY "Events"."EventSourceId"
ORDER BY
"Events"."Time" DESC NULLS LAST,
"Events"."AltKey" DESC NULLS LAST
) as rankRecent
FROM
"Events"
--WHERE "Events"."Time" < @LimitTime
) sub
WHERE
rankRecent <= @N; -- @N is in the range 1 to 100.
Use Cases
Here are the Use Cases for the query:
- I'm loading a dashboard which displays aggregated calculations of recent data for each event source and based on those calculations chooses which event sources to display and in which order.
- I'm investigating an issue that happened yesterday at 3:14:15 a.m., and--for a given collection of relevant event sources--I need to see the 100 events leading up to that time so that I can see what might have been going wrong. In this case, the commented
WHERE
clause in the query will be uncommented, and it may also join to another table in order to filter for event sources that are related to a particular context. I have no idea how to index for this, but this isn't the thrust of the question.
Explain
Here is the result of EXPLAIN (ANALYZE, BUFFERS, SETTINGS)
(in this case, @N
was set to 5):
- Subquery Scan on sub (cost=0.56..2637123.43 rows=6664000 width=66) (actual time=0.156..90245.642 rows=22613 loops=1)
- Filter: (sub.rankrecent <= 5)
- Rows Removed by Filter: 19963368
- Buffers: shared hit=6738934 read=13332278
- -> WindowAgg (cost=0.56..2387223.43 rows=19992000 width=66) (actual time=0.155..89355.268 rows=19985981 loops=1)
- Buffers: shared hit=6738934 read=13332278
- -> Index Scan using "IX_Events_EventSourceId_Time_Desc_AltKey_Desc" on "Events" (cost=0.56..1987383.43 rows=19992000 width=58) (actual time=0.100..82274.745 rows=19985981 loops=1)
- Buffers: shared hit=6738934 read=13332278
- -> Index Scan using "IX_Events_EventSourceId_Time_Desc_AltKey_Desc" on "Events" (cost=0.56..1987383.43 rows=19992000 width=58) (actual time=0.100..82274.745 rows=19985981 loops=1)
- Buffers: shared hit=6738934 read=13332278
- Planning Time: 0.111 ms
- Execution Time: 90247.357 ms
Alternate Approaches I've Considered
- I considered using a materialized view to do the heavy lifting beforehand. However, the
"Events"
table will have data inserted to it about a dozen times per second, and each such transaction can contain anywhere from zero to a few dozen rows. Therefore, if the view uses the same query plan as the existing query, then the data in the view will be entirely out of date before a command to refresh the view can complete. So I concluded there was no point to this strategy. - I also considered trying to add a new integer
sequence
column on the"Events"
table, then filtering on that sequence. But that doesn't really solve the problem. Each transaction that inserts events may or may not have events for a given event source, so the 5 most recent events for a given event source may include data from, for example, a transaction from 5 seconds ago, and a transaction from an hour ago, and one from yesterday and one from last week and one from last year. So there's no easy way to synchronize such asequence
across all the thousands of different event sources. Further, it's possible for an event source to manually submit data that was missed at an arbitrary time in the past, so sorting on"Time"
seems to be right. - I considered having a table that only tracks, say, the 50 most recent events for each event source and drops anything older. That way, the query could go ahead and rank the entire table because there wouldn't be that much data to rank. However, the query would no longer satisfy Use Case #2; and, while Use Case #2 isn't the thrust of this question, I can't just remove it from the query without providing a replacement.
Clarification
The row_number over partition
question is the main question here. The stuff about Use Case #2 and not knowing how to index for it, is included here in case it inspires someone to realize a better plan or query or schema; I'm not expecting a solution for it, just wanting to include any information that may be relevant.
1 Answer 1
A new optimization was introduced in v15 which allows it to stop computing the ranks after it finds the needed number of rows for each partition. This is not magic, it still needs to read all the rows because it doesn't know where the next partition starts other than by reading the rows until the partition key changes. But it should at least shave off some time. When this optimization is in effect, it shows up in the plan with a line like:
WindowAgg (cost=...
Run Condition: (row_number() OVER (?) <= 100)
You could probably improve on that by implementing an index skip scan. PostgreSQL doesn't implement those automatically, but you can do it with a recursive CTE, as shown on the project wiki (Although it looks like maybe someone goobered it up a bit since last time I visited it).
Since you already have a table of distinct event sources, you could make use of that with a lateral join, rather than using the CTE, like this: (I'm not going to hop all over my keyboard with the capitals and quotes, I leave that to you)
select t.* from
eventsources cross join lateral
(select * from events
where eventsourceid=eventsources.id
order by "Events"."Time" DESC NULLS LAST, "Events"."AltKey" DESC NULLS LAST
LIMIT 100
) t
The unused rows in EventSources will slow this down a bit, but it should still probably be much faster than you are currently doing.
For your related question, I don't know I haven't read it carefully. Ask another question if you have another question, once you verified that this approach works for you on this one.
-
Excellent. The lateral join took it from 90 seconds to under 1 second. Thank you so much.kanders84152– kanders841522023年10月12日 22:52:22 +00:00Commented Oct 12, 2023 at 22:52
Explore related questions
See similar questions with these tags.
WHERE "Events"."Time" > <some reasonable cut-off time>
?