Help optimising PGSQL query

Question 1

I have a slow Postgres (v15.3) query and I can't seem to get to the bottom of why. I've considered partitioning, but articles suggest that with a 10GB database partitioning isn't required and it seems like a real pain to set up on a production database.

The tables are 'large' but small in the grand scheme of things compared to other PGSQL databases.

student_tracks has 37m rows in it, this is a months worth of data, so grows at a reasonable rate. However, we'll only retain the last 3 months worth of data. So say 100m rows on average.

students has 10m rows in it and like the above is 1 months worth of data, with a cap at 3 months.

The problem is, producing monthly reports (although ideally want to be able to run 3/6 months reports, although 1 month is fine for now) on these tables. The current query is as follows:

select 
 "students"."id", 
 "students"."student_type_id", 
 "students"."user_agent", 
 "students"."user_agent_is_mobile", 
 (
 select 
 TO_CHAR(
 MAX(student_tracks.enrolled_at) :: DATE, 
 'Mon yy'
 ) 
 from 
 "student_tracks" 
 where 
 student_tracks.student_id = students.id 
 and "student_tracks"."enrolled_at" between '2023-02-01' 
 and '2023-05-30'
 ) as "last_enrolled", 
 (
 select 
 1 
 from 
 "chats" 
 where 
 chats.student_id = students.id 
 and "chats"."created_at" between '2023-02-01' 
 and '2023-05-30' 
 limit 
 1
 ) as "has_chat", 
 (
 select 
 1 
 from 
 "chat_reports" 
 inner join "chats" on "chats"."id" = "chat_reports"."chat_id" 
 where 
 chats.student_id = students.id 
 and "chat_reports"."created_at" between '2023-02-01' 
 and '2023-05-30' 
 limit 
 1
 ) as "has_chat_report" 
from 
 "students" 
where 
 exists (
 select 
 1 
 from 
 "student_tracks" 
 where 
 student_tracks.student_id = students.id 
 and "student_tracks"."enrolled_at" between '2023-02-01' 
 and '2023-05-30'
 )

This query is taking 352 seconds and returns 7.8m records querying a read replica which is idle apart from these test queries.

The query plan for the above is as follows:

Gather (cost=1049084.91..31608500.92 rows=702040 width=215) (actual time=15461.166..352171.314 rows=7132605 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 Buffers: shared hit=80066569 read=434196
 I/O Timings: shared/local read=774055.958
 -> Parallel Hash Semi Join (cost=1048084.91..1693374.65 rows=292517 width=175) (actual time=15436.019..22445.654 rows=2377535 loops=3)
 Hash Cond: ((students.id)::text = (student_tracks.student_id)::text)
 Buffers: shared hit=688458 read=354409
 I/O Timings: shared/local read=658645.994
 -> Parallel Seq Scan on students (cost=0.00..426288.80 rows=4323980 width=175) (actual time=0.008..3101.324 rows=3517892 loops=3)
 Buffers: shared hit=28393 read=354409
 I/O Timings: shared/local read=658645.994
 -> Parallel Hash (cost=894892.34..894892.34 rows=12255405 width=32) (actual time=15361.488..15361.489 rows=9815722 loops=3)
 Buckets: 33554432 Batches: 1 Memory Usage: 2337216kB
 Buffers: shared hit=659943
 -> Parallel Seq Scan on student_tracks (cost=0.00..894892.34 rows=12255405 width=32) (actual time=0.019..6883.347 rows=9815722 loops=3)
" Filter: ((connected_at >= '2023-02-01 00:00:00'::timestamp without time zone) AND (connected_at <= '2023-05-30 00:00:00'::timestamp without time zone))"
 Rows Removed by Filter: 2982354
 Buffers: shared hit=659943
 SubPlan 2
 -> Result (cost=4.69..4.70 rows=1 width=32) (actual time=0.026..0.026 rows=1 loops=7132605)
 Buffers: shared hit=35596058 read=66967
 I/O Timings: shared/local read=86084.242
 InitPlan 1 (returns 1ドル)
 -> Limit (cost=0.56..4.69 rows=1 width=8) (actual time=0.025..0.025 rows=1 loops=7132605)
 Buffers: shared hit=35596058 read=66967
 I/O Timings: shared/local read=86084.242
" -> Index Scan Backward using ""~student_tracks-a1b7f7a8"" on student_tracks student_tracks_1 (cost=0.56..173.61 rows=42 width=8) (actual time=0.023..0.023 rows=1 loops=7132605)"
" Index Cond: (((student_id)::text = (students.id)::text) AND (connected_at IS NOT NULL) AND (connected_at >= '2023-02-01 00:00:00'::timestamp without time zone) AND (connected_at <= '2023-05-30 00:00:00'::timestamp without time zone))"
 Buffers: shared hit=35596058 read=66967
 I/O Timings: shared/local read=86084.242
 SubPlan 3
 -> Limit (cost=0.43..8.45 rows=1 width=4) (actual time=0.010..0.010 rows=0 loops=7132605)
 Buffers: shared hit=21556886 read=8996
 I/O Timings: shared/local read=10695.264
" -> Index Scan using ""~chats-5ca320c0"" on chats (cost=0.43..8.45 rows=1 width=4) (actual time=0.009..0.009 rows=0 loops=7132605)"
" Index Cond: (((student_id)::text = (students.id)::text) AND (created_at >= '2023-02-01 00:00:00'::timestamp without time zone) AND (created_at <= '2023-05-30 00:00:00'::timestamp without time zone))"
 Buffers: shared hit=21556886 read=8996
 I/O Timings: shared/local read=10695.264
 SubPlan 4
 -> Limit (cost=0.85..29.36 rows=1 width=4) (actual time=0.009..0.009 rows=0 loops=7132605)
 Buffers: shared hit=22225167 read=3824
 I/O Timings: shared/local read=18630.459
 -> Nested Loop (cost=0.85..29.36 rows=1 width=4) (actual time=0.009..0.009 rows=0 loops=7132605)
 Buffers: shared hit=22225167 read=3824
 I/O Timings: shared/local read=18630.459
" -> Index Scan using ""~chats-4f9644a3"" on chats chats_1 (cost=0.43..12.46 rows=2 width=33) (actual time=0.007..0.007 rows=0 loops=7132605)"
 Index Cond: ((student_id)::text = (students.id)::text)
 Buffers: shared hit=21581799 read=2191
 I/O Timings: shared/local read=16709.616
 -> Index Scan using chat_leads_pkey on chat_leads (cost=0.42..8.45 rows=1 width=33) (actual time=0.019..0.019 rows=0 loops=184067)
 Index Cond: ((chat_id)::text = (chats_1.id)::text)
" Filter: ((created_at >= '2023-02-01 00:00:00'::timestamp without time zone) AND (created_at <= '2023-05-30 00:00:00'::timestamp without time zone))"
 Rows Removed by Filter: 0
 Buffers: shared hit=643368 read=1633
 I/O Timings: shared/local read=1920.842
Planning:
 Buffers: shared hit=266
Planning Time: 0.876 ms
Execution Time: 353064.689 ms

I would really welcome any suggestions on how to improve this query. I've tried all sorts of postgres query plan analyzers online and just can't get it down to a reasonable timeframe.

Thanks!

Question 2

Please add your relevant table definitions and index definitions to your Post.

Question 3

Why is 6 minutes too long to wait for a monthly report?

Question 4

I would start by converting the subqueries into joins:

SELECT
 "students"."id", 
 "students"."student_type_id", 
 "students"."user_agent", 
 "students"."user_agent_is_mobile", 
 TO_CHAR(
 MAX(student_tracks.enrolled_at) :: DATE, 
 'Mon yy'
 ) as "last_enrolled", 
 ( COUNT("chats"."id") > 0 ) as "has_chat", 
 ( COUNT("chat_reports"."chat_id") > 0 ) as "has_chat_report" 
FROM "students" 
JOIN "student_tracks" ON student_tracks.student_id = students.id 
LEFT JOIN "chats" ON (chats.student_id = students.id and "chats"."created_at" between '2023-02-01' and '2023-05-30')
LEFT JOIN "chat_reports" ON ("chats"."id" = "chat_reports"."chat_id" and "chat_reports"."created_at" between '2023-02-01' and '2023-05-30' )
WHERE
 "student_tracks"."enrolled_at" between '2023-02-01' and '2023-05-30'
GROUP BY 
 "students"."id", 
 "students"."student_type_id", 
 "students"."user_agent", 
 "students"."user_agent_is_mobile"
;

Some obvious indexes would be:

students (id)
students_tracks (student_id)
chats (student_id, id)
chat_reports (chat_id)

Try this and, if the result is not yet fast enough, post all your index definitions and an updated query plan.

Andrea B. Andrea B. 1,7419 silver badges13 bronze badges · Answer 1 · 2023-06-05 13:26:37Z

I would start by converting the subqueries into joins:

SELECT
 "students"."id", 
 "students"."student_type_id", 
 "students"."user_agent", 
 "students"."user_agent_is_mobile", 
 TO_CHAR(
 MAX(student_tracks.enrolled_at) :: DATE, 
 'Mon yy'
 ) as "last_enrolled", 
 ( COUNT("chats"."id") > 0 ) as "has_chat", 
 ( COUNT("chat_reports"."chat_id") > 0 ) as "has_chat_report" 
FROM "students" 
JOIN "student_tracks" ON student_tracks.student_id = students.id 
LEFT JOIN "chats" ON (chats.student_id = students.id and "chats"."created_at" between '2023-02-01' and '2023-05-30')
LEFT JOIN "chat_reports" ON ("chats"."id" = "chat_reports"."chat_id" and "chat_reports"."created_at" between '2023-02-01' and '2023-05-30' )
WHERE
 "student_tracks"."enrolled_at" between '2023-02-01' and '2023-05-30'
GROUP BY 
 "students"."id", 
 "students"."student_type_id", 
 "students"."user_agent", 
 "students"."user_agent_is_mobile"
;

Some obvious indexes would be:

students (id)
students_tracks (student_id)
chats (student_id, id)
chat_reports (chat_id)

Try this and, if the result is not yet fast enough, post all your index definitions and an updated query plan.

Stack Exchange Network

Help optimising PGSQL query

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Help optimising PGSQL query

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions