I have a recursive query that takes too long - 30+ ms where doing the individual queries to extract the same data manually takes < 0.12 ms. So we're talking 250x as long.
I have the following database structure allowing a DAG of group membership (db-fiddle here):
create table subjects
(
subject_id bigint not null
constraint pk_subjects
primary key
);
create table subject_group_members
(
subject_group_id bigint not null
constraint fk_subject_group_members_subject_group_id_subjects_subject_id
references subjects(subject_id)
on delete cascade,
subject_id bigint not null
constraint fk_subject_group_members_subject_id_subjects_subject_id
references subjects(subject_id)
on delete cascade,
constraint pk_subject_group_members
primary key (subject_group_id, subject_id)
);
create index idx_subject_group_members_subject_id
on subject_group_members (subject_id);
create index idx_subject_group_members_subject_group_id
on subject_group_members (subject_group_id);
Data might look like this:
subject_group_id | subject_id |
---|---|
1 | 2 |
1 | 3 |
1 | 4 |
2 | 5 |
3 | 5 |
I want to know all the groups that 5 is a member of (1 by inheritance, 2 & 3 directly, not 4 or any other subject ids).
This query works as expected:
with recursive flat_members(subject_group_id, subject_id) as (
select subject_group_id, subject_id
from subject_group_members gm
union
select
flat_members.subject_group_id as subject_group_id,
subject_group_members.subject_id as subject_id
from subject_group_members
join flat_members on flat_members.subject_id = subject_group_members.subject_group_id
)
select * from flat_members where subject_id = 5
But run with real data I get this query plan:
CTE Scan on flat_members (cost=36759729.47..59962757.76 rows=5156229 width=16) (actual time=26.526..55.166 rows=3 loops=1)
Filter: (subject_id = 30459)
Rows Removed by Filter: 48984
CTE flat_members
-> Recursive Union (cost=0.00..36759729.47 rows=1031245702 width=16) (actual time=0.022..47.638 rows=48987 loops=1)
-> Seq Scan on subject_group_members gm (cost=0.00..745.82 rows=48382 width=16) (actual time=0.019..4.286 rows=48382 loops=1)
-> Merge Join (cost=63629.74..1613406.96 rows=103119732 width=16) (actual time=10.897..11.038 rows=320 loops=2)
Merge Cond: (subject_group_members.subject_group_id = flat_members_1.subject_id)
-> Index Scan using idx_subject_group_members_subject_group_id on subject_group_members (cost=0.29..1651.02 rows=48382 width=16) (actual time=0.009..1.987 rows=24192 loops=2)
-> Materialize (cost=63629.45..66048.55 rows=483820 width=16) (actual time=4.124..6.592 rows=24668 loops=2)
-> Sort (cost=63629.45..64839.00 rows=483820 width=16) (actual time=4.120..5.034 rows=24494 loops=2)
Sort Key: flat_members_1.subject_id
Sort Method: quicksort Memory: 53kB
-> WorkTable Scan on flat_members flat_members_1 (cost=0.00..9676.40 rows=483820 width=16) (actual time=0.001..0.916 rows=24494 loops=2)
Planning Time: 0.296 ms
Execution Time: 56.735 ms
Now if I do it manually, querying select subject_group_id from subject_group_members where subject_id = 30459
and following the tree up, it's 4 queries each taking about 0.02ms.
Is there a way where I can make the recursive query approach the speed of doing the recursion manually?
1 Answer 1
Looks like you inverted the join condition by accident.
WITH RECURSIVE flat_members AS (
SELECT subject_group_id
FROM subject_group_members gm
WHERE subject_id = 5
UNION
SELECT gm.subject_group_id
FROM flat_members fm
JOIN subject_group_members gm ON gm.subject_id = fm.subject_group_id
)
TABLE flat_members;
Plus, move the filter WHERE subject_id = 5
up to the initial SELECT
to filter irrelevant rows early - and allow for an optimized query plan, typically using an index. Speaking of which, this multicolumn index would serve much better, allowing index-only scans:
CREATE INDEX subject_group_members_subject_id_subject_group_id
ON subject_group_members (subject_id, subject_group_id);
(Might as well be UNIQUE
.) In addition to your PK on (subject_group_id, subject_id)
. Or invert the columns in the PK definition, either might be useful.
About index-only scans:
It's typically best to just have a PK on (subject_id, subject_group_id)
, another multicolumn index on (subject_group_id, subject_id)
, and drop the two indexes on only (subject_id)
and (subject_group_id)
. See:
-
Thanks! I assumed the PK on subject_group_members would provide that index, was I wrong?Robert Elliot– Robert Elliot2022年09月17日 12:02:46 +00:00Commented Sep 17, 2022 at 12:02
-
@RobertElliot: Note the update. Wasn't all right, yet. An, no, the PK has column in the opposite order. Also added notes on that.Erwin Brandstetter– Erwin Brandstetter2022年09月17日 12:08:57 +00:00Commented Sep 17, 2022 at 12:08
-
1It's massively faster, but oddly it was faster still without the additional index. Details here: gist.github.com/Mahoney/0482757addcd40928923fd726ae0493eRobert Elliot– Robert Elliot2022年09月17日 12:36:10 +00:00Commented Sep 17, 2022 at 12:36
-
1Updated - yup, you're right, cheaper with the index.Robert Elliot– Robert Elliot2022年09月17日 12:57:53 +00:00Commented Sep 17, 2022 at 12:57
-
1I can't thank you enough for this - I was in big trouble unless I could sort out this performance problem.Robert Elliot– Robert Elliot2022年09月17日 13:06:05 +00:00Commented Sep 17, 2022 at 13:06
Explore related questions
See similar questions with these tags.
subject_id
I actually care about. Instead of finding the two rows that actually contain my desiredsubject_id
, then using theirsubject_group_id
as thesubject_id
in subsequent recursions.