Need help in optimizing a slow Postgres Query

Question 1

So, I need to calculate a rolling Monthly active users on everyday for last 100 days. In output, one column should represent the dates of last 100 days from today and against every single day, number of distinct values of users who occur in the activity table in past 30 days from that day should be calculated.

The query that i have written is returning the desired output. But is too slow. Infact doesn't work for 100 days.

Need Help with optimization !!

Actual Postgresql Code #

Select A.Daily,count(distinct da.user) as Users
from 
 (
 select 
 date(d) as Daily
 from generate_series(current_date - Interval '10' DAY, current_date - Interval '1' DAY, '1 day') d 
 group by 1
 order by 1
 ) as A 
join public.downloads_activity da 
on date_trunc('day',da.created_at) between A.Daily - Interval '30' DAY and A.Daily - Interval '1' DAY
group by 1
order by 1

Expected and Returned Result

 Daily Users
January 12, 2020, 12:00 AM 35,654
January 13, 2020, 12:00 AM 35,465
January 14, 2020, 12:00 AM 35,777
January 15, 2020, 12:00 AM 35,980
January 16, 2020, 12:00 AM 35,953
January 17, 2020, 12:00 AM 35,899
January 18, 2020, 12:00 AM 35,829
January 19, 2020, 12:00 AM 35,627
January 20, 2020, 12:00 AM 35,348
January 21, 2020, 12:00 AM 35,564

Query Plan

GroupAggregate (cost=25.16..70992118.48 rows=200 width=26)
 Group Key: (date(d.d))
 -> Nested Loop (cost=25.16..69869352.15 rows=224552867 width=26)
 Join Filter: ((date_trunc('day'::text, ast_d.created_at) >= ((date(d.d)) - '30 days'::interval day)) AND (date_trunc('day'::text, ast_d.created_at) <= ((date(d.d)) - '1 day'::interval day)))
 -> Sort (cost=25.16..25.66 rows=200 width=8)
 Sort Key: (date(d.d))
 -> HashAggregate (cost=15.02..17.52 rows=200 width=8)
 Group Key: date(d.d)
 -> Function Scan on generate_series d (cost=0.02..12.52 rows=1000 width=8)
 -> Materialize (cost=0.00..571429.19 rows=10104879 width=30)
 -> Seq Scan on asset_downloads ast_d (cost=0.00..451827.79 rows=10104879 width=30)

Query Plan With Analyze Buffers

GroupAggregate (cost=25.16..71011891.94 rows=200 width=26) (actual time=24608.996..182045.961 rows=10 loops=1)
 Group Key: (date(d.d))
 Buffers: shared hit=325977 read=24911 dirtied=28, temp read=552011 written=118170
 -> Nested Loop (cost=25.16..69888811.94 rows=224615600 width=26) (actual time=2213.459..93364.594 rows=16514694 loops=1)
 Join Filter: ((date_trunc('day'::text, ast_d.created_at) >= ((date(d.d)) - '30 days'::interval day)) AND (date_trunc('day'::text, ast_d.created_at) <= ((date(d.d)) - '1 day'::interval day)))
 Rows Removed by Join Filter: 84818776
 Buffers: shared hit=325969 read=24911 dirtied=28, temp read=488070 written=54229
 -> Sort (cost=25.16..25.66 rows=200 width=8) (actual time=0.081..0.093 rows=10 loops=1)
 Sort Key: (date(d.d))
 Sort Method: quicksort Memory: 25kB
 Buffers: shared hit=3
 -> HashAggregate (cost=15.02..17.52 rows=200 width=8) (actual time=0.044..0.052 rows=10 loops=1)
 Group Key: date(d.d)
 -> Function Scan on generate_series d (cost=0.02..12.52 rows=1000 width=8) (actual time=0.028..0.033 rows=10 loops=1)
 -> Materialize (cost=0.00..571588.53 rows=10107702 width=30) (actual time=0.006..3572.397 rows=10133347 loops=10)
 Buffers: shared hit=325966 read=24911 dirtied=28, temp read=488070 written=54229
 -> Seq Scan on asset_downloads ast_d (cost=0.00..451954.02 rows=10107702 width=30) (actual time=0.016..3797.284 rows=10133347 loops=1)
 Buffers: shared hit=325966 read=24911 dirtied=28
Planning time: 8.678 ms
Execution time: 182102.426 ms

Question 2

Please provide EXPLAIN (ANALYZE, BUFFERS) output.

Question 3

@LaurenzAlbe provided. Please check

Question 4

You need an index:

CREATE INDEX ON downloads_activity (date_trunc('day',created_at));

Depending on the data type, both the index and the query might need time zone adjustment.

Question 5

Index is created on the created_at column but is not getting invoked. It is invoking in cases Like (downloads_activity.created_at >'20191201T00:00:00.000Z'). How to explicitly invoke it ?

Question 6

If the index is not used even after an ANALYZE of the table, then the query selects a big part of the table, and it wouldn't be faster. In that case your query cannot be made faster - at least I cannot do it.

Question 7

Select A.Daily,
 count(distinct da.user) as Users
from 
 (
 -- calculate all constants needed once
 select date(d) as Daily,
 date(d) - Interval '30' DAY as DailyStart,
 date(d) - Interval '1' DAY as DailyEnd
 from generate_series(current_date - Interval '10' DAY, current_date - Interval '1' DAY, '1 day') d 
 -- GROUP BY by unique value makes no sense
 -- ORDER BY in subquery is ignored
 ) as A 
 -- avoid fn(table.field)
join public.downloads_activity da on da.created_at >= A.DailyStart 
 and da.created_at < A.DailyEnd
group by 1
order by 1

Question 8

No luck. Taking too long

Question 9

@FridgeBot I do not see another ways to increase the performance. Except pre-calculate these counts on per-day basis (or with another granularity which you need, an hour, for example) into additional table, and get the data from this table.

Laurenz Albe Laurenz Albe 62.1k4 gold badges57 silver badges93 bronze badges · Answer 1 · 2020-01-22 09:45:06Z

1

You need an index:

CREATE INDEX ON downloads_activity (date_trunc('day',created_at));

Depending on the data type, both the index and the query might need time zone adjustment.

Share

Improve this answer

answered Jan 22, 2020 at 9:45

Laurenz Albe's user avatar

Laurenz Albe Laurenz Albe

62.1k4 gold badges57 silver badges93 bronze badges

2

Index is created on the created_at column but is not getting invoked. It is invoking in cases Like (downloads_activity.created_at >'20191201T00:00:00.000Z'). How to explicitly invoke it ?

FridgeBot
– FridgeBot

2020年01月22日 10:58:36 +00:00
Commented Jan 22, 2020 at 10:58
1

If the index is not used even after an ANALYZE of the table, then the query selects a big part of the table, and it wouldn't be faster. In that case your query cannot be made faster - at least I cannot do it.

Laurenz Albe
– Laurenz Albe

2020年01月22日 11:32:13 +00:00
Commented Jan 22, 2020 at 11:32

Add a comment |

Akina Akina 20.8k2 gold badges20 silver badges22 bronze badges · Answer 2 · 2020-01-22 08:45:36Z

Select A.Daily,
 count(distinct da.user) as Users
from 
 (
 -- calculate all constants needed once
 select date(d) as Daily,
 date(d) - Interval '30' DAY as DailyStart,
 date(d) - Interval '1' DAY as DailyEnd
 from generate_series(current_date - Interval '10' DAY, current_date - Interval '1' DAY, '1 day') d 
 -- GROUP BY by unique value makes no sense
 -- ORDER BY in subquery is ignored
 ) as A 
 -- avoid fn(table.field)
join public.downloads_activity da on da.created_at >= A.DailyStart 
 and da.created_at < A.DailyEnd
group by 1
order by 1

@FridgeBot I do not see another ways to increase the performance. Except pre-calculate these counts on per-day basis (or with another granularity which you need, an hour, for example) into additional table, and get the data from this table.

Stack Exchange Network

Need help in optimizing a slow Postgres Query

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Need help in optimizing a slow Postgres Query

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions