1

So, I need to calculate a rolling Monthly active users on everyday for last 100 days. In output, one column should represent the dates of last 100 days from today and against every single day, number of distinct values of users who occur in the activity table in past 30 days from that day should be calculated.

The query that i have written is returning the desired output. But is too slow. Infact doesn't work for 100 days.

Need Help with optimization !!

Actual Postgresql Code #

Select A.Daily,count(distinct da.user) as Users
from 
 (
 select 
 date(d) as Daily
 from generate_series(current_date - Interval '10' DAY, current_date - Interval '1' DAY, '1 day') d 
 group by 1
 order by 1
 ) as A 
join public.downloads_activity da 
on date_trunc('day',da.created_at) between A.Daily - Interval '30' DAY and A.Daily - Interval '1' DAY
group by 1
order by 1

Expected and Returned Result

 Daily Users
January 12, 2020, 12:00 AM 35,654
January 13, 2020, 12:00 AM 35,465
January 14, 2020, 12:00 AM 35,777
January 15, 2020, 12:00 AM 35,980
January 16, 2020, 12:00 AM 35,953
January 17, 2020, 12:00 AM 35,899
January 18, 2020, 12:00 AM 35,829
January 19, 2020, 12:00 AM 35,627
January 20, 2020, 12:00 AM 35,348
January 21, 2020, 12:00 AM 35,564

Query Plan

GroupAggregate (cost=25.16..70992118.48 rows=200 width=26)
 Group Key: (date(d.d))
 -> Nested Loop (cost=25.16..69869352.15 rows=224552867 width=26)
 Join Filter: ((date_trunc('day'::text, ast_d.created_at) >= ((date(d.d)) - '30 days'::interval day)) AND (date_trunc('day'::text, ast_d.created_at) <= ((date(d.d)) - '1 day'::interval day)))
 -> Sort (cost=25.16..25.66 rows=200 width=8)
 Sort Key: (date(d.d))
 -> HashAggregate (cost=15.02..17.52 rows=200 width=8)
 Group Key: date(d.d)
 -> Function Scan on generate_series d (cost=0.02..12.52 rows=1000 width=8)
 -> Materialize (cost=0.00..571429.19 rows=10104879 width=30)
 -> Seq Scan on asset_downloads ast_d (cost=0.00..451827.79 rows=10104879 width=30)

Query Plan With Analyze Buffers

GroupAggregate (cost=25.16..71011891.94 rows=200 width=26) (actual time=24608.996..182045.961 rows=10 loops=1)
 Group Key: (date(d.d))
 Buffers: shared hit=325977 read=24911 dirtied=28, temp read=552011 written=118170
 -> Nested Loop (cost=25.16..69888811.94 rows=224615600 width=26) (actual time=2213.459..93364.594 rows=16514694 loops=1)
 Join Filter: ((date_trunc('day'::text, ast_d.created_at) >= ((date(d.d)) - '30 days'::interval day)) AND (date_trunc('day'::text, ast_d.created_at) <= ((date(d.d)) - '1 day'::interval day)))
 Rows Removed by Join Filter: 84818776
 Buffers: shared hit=325969 read=24911 dirtied=28, temp read=488070 written=54229
 -> Sort (cost=25.16..25.66 rows=200 width=8) (actual time=0.081..0.093 rows=10 loops=1)
 Sort Key: (date(d.d))
 Sort Method: quicksort Memory: 25kB
 Buffers: shared hit=3
 -> HashAggregate (cost=15.02..17.52 rows=200 width=8) (actual time=0.044..0.052 rows=10 loops=1)
 Group Key: date(d.d)
 -> Function Scan on generate_series d (cost=0.02..12.52 rows=1000 width=8) (actual time=0.028..0.033 rows=10 loops=1)
 -> Materialize (cost=0.00..571588.53 rows=10107702 width=30) (actual time=0.006..3572.397 rows=10133347 loops=10)
 Buffers: shared hit=325966 read=24911 dirtied=28, temp read=488070 written=54229
 -> Seq Scan on asset_downloads ast_d (cost=0.00..451954.02 rows=10107702 width=30) (actual time=0.016..3797.284 rows=10133347 loops=1)
 Buffers: shared hit=325966 read=24911 dirtied=28
Planning time: 8.678 ms
Execution time: 182102.426 ms
Laurenz Albe
62.1k4 gold badges57 silver badges93 bronze badges
asked Jan 22, 2020 at 8:35
2
  • Please provide EXPLAIN (ANALYZE, BUFFERS) output. Commented Jan 22, 2020 at 8:52
  • @LaurenzAlbe provided. Please check Commented Jan 22, 2020 at 9:25

2 Answers 2

1

You need an index:

CREATE INDEX ON downloads_activity (date_trunc('day',created_at));

Depending on the data type, both the index and the query might need time zone adjustment.

answered Jan 22, 2020 at 9:45
2
  • Index is created on the created_at column but is not getting invoked. It is invoking in cases Like (downloads_activity.created_at >'20191201T00:00:00.000Z'). How to explicitly invoke it ? Commented Jan 22, 2020 at 10:58
  • 1
    If the index is not used even after an ANALYZE of the table, then the query selects a big part of the table, and it wouldn't be faster. In that case your query cannot be made faster - at least I cannot do it. Commented Jan 22, 2020 at 11:32
0
Select A.Daily,
 count(distinct da.user) as Users
from 
 (
 -- calculate all constants needed once
 select date(d) as Daily,
 date(d) - Interval '30' DAY as DailyStart,
 date(d) - Interval '1' DAY as DailyEnd
 from generate_series(current_date - Interval '10' DAY, current_date - Interval '1' DAY, '1 day') d 
 -- GROUP BY by unique value makes no sense
 -- ORDER BY in subquery is ignored
 ) as A 
 -- avoid fn(table.field)
join public.downloads_activity da on da.created_at >= A.DailyStart 
 and da.created_at < A.DailyEnd
group by 1
order by 1
answered Jan 22, 2020 at 8:45
2
  • No luck. Taking too long Commented Jan 22, 2020 at 9:26
  • @FridgeBot I do not see another ways to increase the performance. Except pre-calculate these counts on per-day basis (or with another granularity which you need, an hour, for example) into additional table, and get the data from this table. Commented Jan 22, 2020 at 9:35

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.