How to incrementally optimize a `COUNT(*) ... GROUPBY` query in PostgreSQL?

Question 1

I have a PostgreSQL table with ~millions of rows (timestamp, event, distinct_id, properties) where I want to

filter based on event (low cardinality) and timestamp,
and group by date_trunc('week', timestamp) and distinct_id (high cardinality)

I have already squeezed the performance of COUNT(*) ... GROUP BY week queries, specifically these ones:

SELECT
 date_trunc('week', timestamp) AS "timestamp"
 , count(*) AS "count"
FROM telemetry_events
WHERE (
 (event = 'view load' OR event = 'view:loaded')
 AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z'
)
GROUP BY
 date_trunc('week', timestamp)
ORDER BY
 date_trunc('week', timestamp) ASC
;

by creating two indexes:

CREATE INDEX idx_timestamp ON public.telemetry_events USING btree ("timestamp")
CREATE INDEX telemetry_events_event_timestamp_idx ON public.telemetry_events USING btree (event, "timestamp")

The current execution time is under 7 seconds, which I find excellent (at least for now). This is the EXPLAIN ANALYZE:

GroupAggregate (cost=83849.24..87478.16 rows=181418 width=16) (actual time=6971.824..6989.712 rows=21 loops=1)
 Group Key: (date_trunc('week'::text, "timestamp"))
 -> Sort (cost=83849.24..84302.97 rows=181493 width=8) (actual time=6971.735..6978.683 rows=116425 loops=1)
 Sort Key: (date_trunc('week'::text, "timestamp"))
 Sort Method: quicksort Memory: 8189kB
 -> Index Only Scan using telemetry_events_event_timestamp_idx on telemetry_events (cost=0.43..67996.23 rows=181493 width=8) (actual time=6507.580..6906.571 rows=116425 loops=1)
 Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone)
 Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text))
 Rows Removed by Filter: 639794
 Heap Fetches: 0
Planning Time: 1.873 ms
Execution Time: 6989.778 ms

However, if I add my high cardinality distinct_id to the GROUP BY, the queries are much slower (45 to 70 seconds):

SELECT
 date_trunc('week', timestamp) AS "timestamp"
 , distinct_id
 , count(*) AS "count"
FROM telemetry_events
WHERE (
 (event = 'view load' OR event = 'view:loaded')
 AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z'
)
GROUP BY (
 date_trunc('week', timestamp)
 , distinct_id
)
ORDER BY
 date_trunc('week', timestamp) ASC
;

This is the EXPLAIN ANALYZE:

Finalize GroupAggregate (cost=88115.68..110676.54 rows=181418 width=52) (actual time=71956.298..72028.293 rows=1326 loops=1)
 Group Key: (date_trunc('week'::text, "timestamp")), distinct_id
 -> Gather Merge (cost=88115.68..107274.48 rows=151244 width=52) (actual time=71956.254..72027.304 rows=2526 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 -> Partial GroupAggregate (cost=87115.65..88817.15 rows=75622 width=52) (actual time=71740.225..71758.172 rows=842 loops=3)
 Group Key: (date_trunc('week'::text, "timestamp")), distinct_id
 -> Sort (cost=87115.65..87304.71 rows=75622 width=44) (actual time=71736.479..71743.974 rows=38808 loops=3)
 Sort Key: (date_trunc('week'::text, "timestamp")), distinct_id
 Sort Method: quicksort Memory: 4520kB
 Worker 0: Sort Method: quicksort Memory: 4507kB
 Worker 1: Sort Method: quicksort Memory: 4679kB
 -> Parallel Index Scan using idx_timestamp on telemetry_events (cost=0.43..80987.81 rows=75622 width=44) (actual time=402.899..71576.001 rows=38808 loops=3)
 Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone)
 Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text))
 Rows Removed by Filter: 213265
Planning Time: 1.942 ms
Execution Time: 72029.136 ms

What would be the best way to try to optimize these queries?

And some sub-questions: is my expectation of keeping this query under ~10 seconds reasonable? How fast can I make PostgreSQL for these kind of OLAP workloads?

Question 2

Why do you have (event = 'view load' OR event = 'view:loaded')? Do those actually mean two different things, or do you just have dirty data with two difference spellings for the same meaning?

Your first query is just using the index as a skinny version of the table. Note that the Index Cond doesn't include the leading column of the index, which for a btree index means it is the scanning the whole index and just applying the other column as an "in-index filter". This can be faster than scanning the table because the index might be much smaller than the table.

In your second query, this doesn't work well because one necessary column is not in the index, so it needs to visit the table anyway. A good index for the 2nd query would be ("timestamp",event,distinct_id) This index should also work for the first query even better than the current index.

But better yet would probably be cleaning your data so you don't need the OR. Or making a filtered index which is filtered on the OR condition.

Question 3

"do you just have dirty data with two difference spellings for the same meaning?" -- that's correct

Question 4

"But better yet would probably be cleaning your data so you don't need the OR" -- I acknowledge this is a separate question, but given that this is a warehouse and I don't care so much about denormalized data, in your opinion what would be the best practice here?

Question 5

By the way, I confirm that the OR clause is affecting performance quite a lot, thanks for the insight 👍

jjanes jjanes 42.4k3 gold badges44 silver badges54 bronze badges · Accepted Answer · 2022-05-16 14:37:52Z

Why do you have (event = 'view load' OR event = 'view:loaded')? Do those actually mean two different things, or do you just have dirty data with two difference spellings for the same meaning?

Your first query is just using the index as a skinny version of the table. Note that the Index Cond doesn't include the leading column of the index, which for a btree index means it is the scanning the whole index and just applying the other column as an "in-index filter". This can be faster than scanning the table because the index might be much smaller than the table.

In your second query, this doesn't work well because one necessary column is not in the index, so it needs to visit the table anyway. A good index for the 2nd query would be ("timestamp",event,distinct_id) This index should also work for the first query even better than the current index.

But better yet would probably be cleaning your data so you don't need the OR. Or making a filtered index which is filtered on the OR condition.

"do you just have dirty data with two difference spellings for the same meaning?" -- that's correct
"But better yet would probably be cleaning your data so you don't need the OR" -- I acknowledge this is a separate question, but given that this is a warehouse and I don't care so much about denormalized data, in your opinion what would be the best practice here?
By the way, I confirm that the OR clause is affecting performance quite a lot, thanks for the insight 👍

Stack Exchange Network

How to incrementally optimize a `COUNT(*) ... GROUPBY` query in PostgreSQL?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to incrementally optimize a `COUNT(*) ... GROUPBY` query in PostgreSQL?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions