Joining historical data in postgres

Question 1

I have a table of data (omitted below), a table of state transitions for that data, and a table of snapshots of that data.

CREATE TABLE thing_event (thing int, created timestamp, old_state bool, new_state bool);
CREATE TABLE thing_snapshot (thing int, created timestamp, state bool);
CREATE INDEX thing_event_idx ON thing_event USING btree (thing, created);
CREATE INDEX thing_snapshot_idx ON thing_snapshot USING btree (thing, created);
INSERT INTO thing_event (thing, created, old_state, new_state)
SELECT generate_series, now() - interval '365' day * random(), random() > 0.5, random() > 0.5
FROM generate_series(1, 100000) JOIN (select 1 from generate_series(1, 10)) g2 on true;
INSERT INTO thing_snapshot (thing, created, state)
SELECT thing, created, new_state FROM thing_event;

I want to get the state of the data as of a historical state transition:

EXPLAIN ANALYZE WITH cte AS
 (SELECT *, row_number() OVER (PARTITION BY sn.thing ORDER BY sn.created DESC)
 FROM thing_snapshot sn
 JOIN thing_event ev
 ON sn.thing = ev.thing AND sn.created <= ev.created)
SELECT * FROM cte WHERE row_number = 1;
Subquery Scan on cte (cost=57554.25..512495.39 rows=16609 width=35) (actual time=105.736..2687.783 rows=100000 loops=1)
 Filter: (cte.row_number = 1)
 Rows Removed by Filter: 5400000
 -> WindowAgg (cost=57554.25..470971.97 rows=3321873 width=35) (actual time=105.736..2565.230 rows=5500000 loops=1)
 -> Merge Join (cost=57554.25..412839.20 rows=3321873 width=27) (actual time=105.724..1308.860 rows=5500000 loops=1)
 Merge Cond: (sn.thing = ev.thing)
 Join Filter: (sn.created <= ev.created)
 Rows Removed by Join Filter: 4500000
 -> Gather Merge (cost=57552.00..174018.48 rows=1000000 width=13) (actual time=105.673..228.029 rows=1000000 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 -> Sort (cost=56551.98..57593.65 rows=416667 width=13) (actual time=97.910..122.637 rows=333333 loops=3)
 Sort Key: sn.thing, sn.created DESC
 Sort Method: external merge Disk: 9272kB
 Worker 0: Sort Method: external merge Disk: 8520kB
 Worker 1: Sort Method: external merge Disk: 8704kB
 -> Parallel Seq Scan on thing_snapshot sn (cost=0.00..10536.67 rows=416667 width=13) (actual time=0.074..22.694 rows=333333 loops=3)
 -> Materialize (cost=0.42..64424.99 rows=1000000 width=14) (actual time=0.022..370.762 rows=9999991 loops=1)
 -> Index Scan using thing_event_idx on thing_event ev (cost=0.42..61924.99 rows=1000000 width=14) (actual time=0.017..147.555 rows=1000000 loops=1)

That all makes sense, 500k is 1MM * an average of half of the snapshots matching the join condition per event, but is there a faster way to pull this off? The Rows Removed by Filter: 5400000 would be nice to bypass somehow.

Here's an alternative, simpler but slower:

EXPLAIN ANALYZE
SELECT *
FROM thing_event ev
INNER JOIN LATERAL
 (SELECT *, row_number() OVER (PARTITION BY thing ORDER BY created DESC)
 FROM thing_snapshot WHERE thing = ev.thing AND created <= ev.created) sn ON row_number = 1
Nested Loop (cost=0.42..16562910.43 rows=1000000 width=35) (actual time=0.045..2946.091 rows=1000000 loops=1)
 -> Seq Scan on thing_event ev (cost=0.00..16370.00 rows=1000000 width=14) (actual time=0.011..39.413 rows=1000000 loops=1)
 -> Subquery Scan on sn (cost=0.42..16.54 rows=1 width=21) (actual time=0.001..0.003 rows=1 loops=1000000)
 Filter: (sn.row_number = 1)
 Rows Removed by Filter: 4
 -> WindowAgg (cost=0.42..16.50 rows=3 width=21) (actual time=0.001..0.003 rows=6 loops=1000000)
 -> Index Scan Backward using thing_snapshot_idx on thing_snapshot (cost=0.42..16.45 rows=3 width=13) (actual time=0.001..0.001 rows=6 loops=1000000)
 Index Cond: ((thing = ev.thing) AND (created <= ev.created))

EDIT

I may have simplified to the point of obfuscation. Created is heterogenous between event and snapshot — I want the data on the snapshot as of the event's created date. Here's some data to illustrate:

event1 | 2021年09月01日
event2 | 2021年09月11日
event3 | 2021年09月21日
snapshot1 | 2021年09月05日
snapshot2 | 2021年09月15日

If I filter by created < 2021年09月20日, I want the most recent event before that date, event2, and the most recent snapshot before that event, snapshot1, not snapshot2.

To get the example data, change the last statement above to:

INSERT INTO thing_snapshot (thing, created, state)
SELECT thing, created - interval '30' day * random(), new_state
FROM thing_event;

Question 2

I think you've over-simplified your example, but I have to work with what is provided.

Related: https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group

Your query is asking for the most recent event for each thing, an example of the venerable pattern greatest-n-per-group. The join is superfluous. In that case, this query is the fastest I have found so far:

SELECT te2.thing AS thing, max(created) AS created 
FROM thing_event te 
JOIN thing_event te2 USING (created) 
GROUP BY te2.thing

A PostgreSQL-specific version is this one (which can be faster):

SELECT DISTINCT ON (thing) 
* 
FROM thing_event ev 
ORDER BY thing, created DESC

If you want to keep the join:

SELECT DISTINCT ON (thing) 
* 
FROM thing_snapshot sn 
JOIN thing_event ev USING (thing) 
ORDER BY thing, sn.created DESC

It seems, though, you wanted this, which gets the state of all things as of a timestamp:

SELECT ts2.thing, ts2.state, max(created) AS created 
FROM thing_snapshot ts 
JOIN thing_snapshot ts2 USING (created) 
WHERE created <= '2021-04-26 15:02:13.720671' 
GROUP BY ts2.thing, ts2.state 
ORDER BY 3 desc

None of these are going to be particularly fast since you want all the things and the most efficient access method is a table scan. I experimented with various indexes and they all reduce performance. In this query PostgreSQL can do a synchronized table scan and just read thing_snapshot once. Increasing work_mem improves the performance of the hash join.

Limitations of the above:

Since none of your columns are defined NOT NULL, things will likely get nutty if any NULLs show up. I did not consider them as part of these queries.
If there are conflicting entries (same created time and different state for a given thing) then both will be returned, or a random one in the case of the SELECT DISTINCT ON queries. A UNIQUE constraint on (thing, created) is recommended to avoid this possibility.

EDIT

As per your update I have built this query. It has some issues with the row estimates being wildly off, and isn't very fast (takes about 2.5s on my server), but it returns the requested data. Out of time tonite to optimize more. Might be a candidate for the index skip scan hack.

WITH cte_ts AS (
SELECT ts.thing, ts.created, ts2.state
FROM (
 SELECT thing, max(created) AS created FROM thing_snapshot ts_s 
 WHERE created <= '2021-04-26'
 GROUP BY thing
) ts
JOIN thing_snapshot ts2 ON (ts.thing=ts2.thing AND ts.created=ts2.created)
)
, cte_te AS (
SELECT te.thing, te.created, te2.old_state, te2.new_state
FROM (
 SELECT thing, max(created) AS created FROM thing_event te_s 
 WHERE created <= '2021-04-26'
 GROUP BY thing
) te
JOIN thing_event te2 ON (te.thing=te2.thing AND te.created=te2.created)
)
SELECT * FROM cte_ts
JOIN cte_te USING (thing)

EDIT 2

I got it down to 1.2s with these settings:

SET join_collapse_limit=1;
SET work_mem='100MB';

Question 3

I've edited my question, you're right that my example data didn't include the essential feature that the snapshot's timestamp is not the same as the event's timestamp. So your suggestion that a full scan is necessary is probably accurate.

user234725user234725 · Accepted Answer · 2021-09-08 01:23:37Z

I think you've over-simplified your example, but I have to work with what is provided.

Related: https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group

Your query is asking for the most recent event for each thing, an example of the venerable pattern greatest-n-per-group. The join is superfluous. In that case, this query is the fastest I have found so far:

SELECT te2.thing AS thing, max(created) AS created 
FROM thing_event te 
JOIN thing_event te2 USING (created) 
GROUP BY te2.thing

A PostgreSQL-specific version is this one (which can be faster):

SELECT DISTINCT ON (thing) 
* 
FROM thing_event ev 
ORDER BY thing, created DESC

If you want to keep the join:

SELECT DISTINCT ON (thing) 
* 
FROM thing_snapshot sn 
JOIN thing_event ev USING (thing) 
ORDER BY thing, sn.created DESC

It seems, though, you wanted this, which gets the state of all things as of a timestamp:

SELECT ts2.thing, ts2.state, max(created) AS created 
FROM thing_snapshot ts 
JOIN thing_snapshot ts2 USING (created) 
WHERE created <= '2021-04-26 15:02:13.720671' 
GROUP BY ts2.thing, ts2.state 
ORDER BY 3 desc

None of these are going to be particularly fast since you want all the things and the most efficient access method is a table scan. I experimented with various indexes and they all reduce performance. In this query PostgreSQL can do a synchronized table scan and just read thing_snapshot once. Increasing work_mem improves the performance of the hash join.

Limitations of the above:

Since none of your columns are defined NOT NULL, things will likely get nutty if any NULLs show up. I did not consider them as part of these queries.
If there are conflicting entries (same created time and different state for a given thing) then both will be returned, or a random one in the case of the SELECT DISTINCT ON queries. A UNIQUE constraint on (thing, created) is recommended to avoid this possibility.

EDIT

As per your update I have built this query. It has some issues with the row estimates being wildly off, and isn't very fast (takes about 2.5s on my server), but it returns the requested data. Out of time tonite to optimize more. Might be a candidate for the index skip scan hack.

WITH cte_ts AS (
SELECT ts.thing, ts.created, ts2.state
FROM (
 SELECT thing, max(created) AS created FROM thing_snapshot ts_s 
 WHERE created <= '2021-04-26'
 GROUP BY thing
) ts
JOIN thing_snapshot ts2 ON (ts.thing=ts2.thing AND ts.created=ts2.created)
)
, cte_te AS (
SELECT te.thing, te.created, te2.old_state, te2.new_state
FROM (
 SELECT thing, max(created) AS created FROM thing_event te_s 
 WHERE created <= '2021-04-26'
 GROUP BY thing
) te
JOIN thing_event te2 ON (te.thing=te2.thing AND te.created=te2.created)
)
SELECT * FROM cte_ts
JOIN cte_te USING (thing)

EDIT 2

I got it down to 1.2s with these settings:

SET join_collapse_limit=1;
SET work_mem='100MB';

I've edited my question, you're right that my example data didn't include the essential feature that the snapshot's timestamp is not the same as the event's timestamp. So your suggestion that a full scan is necessary is probably accurate.

Stack Exchange Network

Joining historical data in postgres

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions