0

I have a PostgreSQL table with ~millions of rows (timestamp, event, distinct_id, properties) where I want to

  • filter based on event (low cardinality) and timestamp,
  • and group by date_trunc('week', timestamp) and distinct_id (high cardinality)

I have already squeezed the performance of COUNT(*) ... GROUP BY week queries, specifically these ones:

SELECT
 date_trunc('week', timestamp) AS "timestamp"
 , count(*) AS "count"
FROM telemetry_events
WHERE (
 (event = 'view load' OR event = 'view:loaded')
 AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z'
)
GROUP BY
 date_trunc('week', timestamp)
ORDER BY
 date_trunc('week', timestamp) ASC
;

by creating two indexes:

CREATE INDEX idx_timestamp ON public.telemetry_events USING btree ("timestamp")
CREATE INDEX telemetry_events_event_timestamp_idx ON public.telemetry_events USING btree (event, "timestamp")

The current execution time is under 7 seconds, which I find excellent (at least for now). This is the EXPLAIN ANALYZE:

GroupAggregate (cost=83849.24..87478.16 rows=181418 width=16) (actual time=6971.824..6989.712 rows=21 loops=1)
 Group Key: (date_trunc('week'::text, "timestamp"))
 -> Sort (cost=83849.24..84302.97 rows=181493 width=8) (actual time=6971.735..6978.683 rows=116425 loops=1)
 Sort Key: (date_trunc('week'::text, "timestamp"))
 Sort Method: quicksort Memory: 8189kB
 -> Index Only Scan using telemetry_events_event_timestamp_idx on telemetry_events (cost=0.43..67996.23 rows=181493 width=8) (actual time=6507.580..6906.571 rows=116425 loops=1)
 Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone)
 Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text))
 Rows Removed by Filter: 639794
 Heap Fetches: 0
Planning Time: 1.873 ms
Execution Time: 6989.778 ms

However, if I add my high cardinality distinct_id to the GROUP BY, the queries are much slower (45 to 70 seconds):

SELECT
 date_trunc('week', timestamp) AS "timestamp"
 , distinct_id
 , count(*) AS "count"
FROM telemetry_events
WHERE (
 (event = 'view load' OR event = 'view:loaded')
 AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z'
)
GROUP BY (
 date_trunc('week', timestamp)
 , distinct_id
)
ORDER BY
 date_trunc('week', timestamp) ASC
;

This is the EXPLAIN ANALYZE:

Finalize GroupAggregate (cost=88115.68..110676.54 rows=181418 width=52) (actual time=71956.298..72028.293 rows=1326 loops=1)
 Group Key: (date_trunc('week'::text, "timestamp")), distinct_id
 -> Gather Merge (cost=88115.68..107274.48 rows=151244 width=52) (actual time=71956.254..72027.304 rows=2526 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 -> Partial GroupAggregate (cost=87115.65..88817.15 rows=75622 width=52) (actual time=71740.225..71758.172 rows=842 loops=3)
 Group Key: (date_trunc('week'::text, "timestamp")), distinct_id
 -> Sort (cost=87115.65..87304.71 rows=75622 width=44) (actual time=71736.479..71743.974 rows=38808 loops=3)
 Sort Key: (date_trunc('week'::text, "timestamp")), distinct_id
 Sort Method: quicksort Memory: 4520kB
 Worker 0: Sort Method: quicksort Memory: 4507kB
 Worker 1: Sort Method: quicksort Memory: 4679kB
 -> Parallel Index Scan using idx_timestamp on telemetry_events (cost=0.43..80987.81 rows=75622 width=44) (actual time=402.899..71576.001 rows=38808 loops=3)
 Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone)
 Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text))
 Rows Removed by Filter: 213265
Planning Time: 1.942 ms
Execution Time: 72029.136 ms

What would be the best way to try to optimize these queries?

And some sub-questions: is my expectation of keeping this query under ~10 seconds reasonable? How fast can I make PostgreSQL for these kind of OLAP workloads?

asked May 16, 2022 at 13:04

1 Answer 1

1

Why do you have (event = 'view load' OR event = 'view:loaded')? Do those actually mean two different things, or do you just have dirty data with two difference spellings for the same meaning?

Your first query is just using the index as a skinny version of the table. Note that the Index Cond doesn't include the leading column of the index, which for a btree index means it is the scanning the whole index and just applying the other column as an "in-index filter". This can be faster than scanning the table because the index might be much smaller than the table.

In your second query, this doesn't work well because one necessary column is not in the index, so it needs to visit the table anyway. A good index for the 2nd query would be ("timestamp",event,distinct_id) This index should also work for the first query even better than the current index.

But better yet would probably be cleaning your data so you don't need the OR. Or making a filtered index which is filtered on the OR condition.

answered May 16, 2022 at 14:37
3
  • "do you just have dirty data with two difference spellings for the same meaning?" -- that's correct Commented May 16, 2022 at 16:00
  • "But better yet would probably be cleaning your data so you don't need the OR" -- I acknowledge this is a separate question, but given that this is a warehouse and I don't care so much about denormalized data, in your opinion what would be the best practice here? Commented May 16, 2022 at 16:03
  • By the way, I confirm that the OR clause is affecting performance quite a lot, thanks for the insight 👍 Commented May 16, 2022 at 16:08

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.