How to optimize a recursive query in Postgresql?

Question 1

I have a recursive query that takes too long - 30+ ms where doing the individual queries to extract the same data manually takes < 0.12 ms. So we're talking 250x as long.

I have the following database structure allowing a DAG of group membership (db-fiddle here):

create table subjects
(
 subject_id bigint not null
 constraint pk_subjects
 primary key
);
create table subject_group_members
(
 subject_group_id bigint not null
 constraint fk_subject_group_members_subject_group_id_subjects_subject_id
 references subjects(subject_id)
 on delete cascade,
 subject_id bigint not null
 constraint fk_subject_group_members_subject_id_subjects_subject_id
 references subjects(subject_id)
 on delete cascade,
 constraint pk_subject_group_members
 primary key (subject_group_id, subject_id)
);
create index idx_subject_group_members_subject_id
 on subject_group_members (subject_id);
create index idx_subject_group_members_subject_group_id
 on subject_group_members (subject_group_id);

Data might look like this:

subject_group_id	subject_id
1	2
1	3
1	4
2	5
3	5

I want to know all the groups that 5 is a member of (1 by inheritance, 2 & 3 directly, not 4 or any other subject ids).

This query works as expected:

with recursive flat_members(subject_group_id, subject_id) as (
 select subject_group_id, subject_id
 from subject_group_members gm
 union
 select
 flat_members.subject_group_id as subject_group_id,
 subject_group_members.subject_id as subject_id
 from subject_group_members
 join flat_members on flat_members.subject_id = subject_group_members.subject_group_id
 )
 select * from flat_members where subject_id = 5

But run with real data I get this query plan:

CTE Scan on flat_members (cost=36759729.47..59962757.76 rows=5156229 width=16) (actual time=26.526..55.166 rows=3 loops=1)
 Filter: (subject_id = 30459)
 Rows Removed by Filter: 48984
 CTE flat_members
 -> Recursive Union (cost=0.00..36759729.47 rows=1031245702 width=16) (actual time=0.022..47.638 rows=48987 loops=1)
 -> Seq Scan on subject_group_members gm (cost=0.00..745.82 rows=48382 width=16) (actual time=0.019..4.286 rows=48382 loops=1)
 -> Merge Join (cost=63629.74..1613406.96 rows=103119732 width=16) (actual time=10.897..11.038 rows=320 loops=2)
 Merge Cond: (subject_group_members.subject_group_id = flat_members_1.subject_id)
 -> Index Scan using idx_subject_group_members_subject_group_id on subject_group_members (cost=0.29..1651.02 rows=48382 width=16) (actual time=0.009..1.987 rows=24192 loops=2)
 -> Materialize (cost=63629.45..66048.55 rows=483820 width=16) (actual time=4.124..6.592 rows=24668 loops=2)
 -> Sort (cost=63629.45..64839.00 rows=483820 width=16) (actual time=4.120..5.034 rows=24494 loops=2)
 Sort Key: flat_members_1.subject_id
 Sort Method: quicksort Memory: 53kB
 -> WorkTable Scan on flat_members flat_members_1 (cost=0.00..9676.40 rows=483820 width=16) (actual time=0.001..0.916 rows=24494 loops=2)
Planning Time: 0.296 ms
Execution Time: 56.735 ms

Now if I do it manually, querying select subject_group_id from subject_group_members where subject_id = 30459 and following the tree up, it's 4 queries each taking about 0.02ms.

Is there a way where I can make the recursive query approach the speed of doing the recursion manually?

Question 2

I feel like the problem is that I merge join the whole lot up front, then filter by the subject_id I actually care about. Instead of finding the two rows that actually contain my desired subject_id, then using their subject_group_id as the subject_id in subsequent recursions.

Question 3

Looks like you inverted the join condition by accident.

WITH RECURSIVE flat_members AS (
 SELECT subject_group_id
 FROM subject_group_members gm
 WHERE subject_id = 5
 UNION
 SELECT gm.subject_group_id
 FROM flat_members fm
 JOIN subject_group_members gm ON gm.subject_id = fm.subject_group_id
 )
TABLE flat_members;

fiddle

Plus, move the filter WHERE subject_id = 5 up to the initial SELECT to filter irrelevant rows early - and allow for an optimized query plan, typically using an index. Speaking of which, this multicolumn index would serve much better, allowing index-only scans:

CREATE INDEX subject_group_members_subject_id_subject_group_id
 ON subject_group_members (subject_id, subject_group_id);

(Might as well be UNIQUE.) In addition to your PK on (subject_group_id, subject_id). Or invert the columns in the PK definition, either might be useful.

About index-only scans:

Can Postgres use an index-only scan for this query with joined tables?

It's typically best to just have a PK on (subject_id, subject_group_id), another multicolumn index on (subject_group_id, subject_id), and drop the two indexes on only (subject_id) and (subject_group_id). See:

Is a composite index also good for queries on the first field?

Question 4

Thanks! I assumed the PK on subject_group_members would provide that index, was I wrong?

Question 5

@RobertElliot: Note the update. Wasn't all right, yet. An, no, the PK has column in the opposite order. Also added notes on that.

Question 6

It's massively faster, but oddly it was faster still without the additional index. Details here: gist.github.com/Mahoney/0482757addcd40928923fd726ae0493e

Question 7

Updated - yup, you're right, cheaper with the index.

Question 8

I can't thank you enough for this - I was in big trouble unless I could sort out this performance problem.

score 5 · Accepted Answer · 2022-09-17 11:54:28Z

Looks like you inverted the join condition by accident.

WITH RECURSIVE flat_members AS (
 SELECT subject_group_id
 FROM subject_group_members gm
 WHERE subject_id = 5
 UNION
 SELECT gm.subject_group_id
 FROM flat_members fm
 JOIN subject_group_members gm ON gm.subject_id = fm.subject_group_id
 )
TABLE flat_members;

fiddle

Plus, move the filter WHERE subject_id = 5 up to the initial SELECT to filter irrelevant rows early - and allow for an optimized query plan, typically using an index. Speaking of which, this multicolumn index would serve much better, allowing index-only scans:

CREATE INDEX subject_group_members_subject_id_subject_group_id
 ON subject_group_members (subject_id, subject_group_id);

(Might as well be UNIQUE.) In addition to your PK on (subject_group_id, subject_id). Or invert the columns in the PK definition, either might be useful.

About index-only scans:

Can Postgres use an index-only scan for this query with joined tables?

It's typically best to just have a PK on (subject_id, subject_group_id), another multicolumn index on (subject_group_id, subject_id), and drop the two indexes on only (subject_id) and (subject_group_id). See:

Is a composite index also good for queries on the first field?

Thanks! I assumed the PK on subject_group_members would provide that index, was I wrong?
@RobertElliot: Note the update. Wasn't all right, yet. An, no, the PK has column in the opposite order. Also added notes on that.
It's massively faster, but oddly it was faster still without the additional index. Details here: gist.github.com/Mahoney/0482757addcd40928923fd726ae0493e
I can't thank you enough for this - I was in big trouble unless I could sort out this performance problem.

Stack Exchange Network

How to optimize a recursive query in Postgresql?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

How to optimize a recursive query in Postgresql?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions