I have a database about books and book sales. The tables look like this:
CREATE TABLE purchase (
person_id integer,
book_id integer,
CONSTRAINT purchase_pk PRIMARY KEY (person_id, book_id)
);
CREATE TABLE person (
person_id integer PRIMARY KEY
);
CREATE TABLE book (
book_id integer PRIMARY KEY
);
CREATE INDEX purchase_idx ON purchase (person_id, book_id);
I would need to know the number of people who have brought both of the books for each pair of books, so basically:
book_1_id | book_2_id | people_count_who_bought_both_books |
---|---|---|
1 | 2 | 45 |
1 | 3 | 789 |
I have defined the following query for this:
CREATE MATERIALIZED VIEW book_sales_stat AS
SELECT b1.book_id AS book_1_id,
b2.book_id AS book_2_id,
(SELECT count(*)
FROM person AS per
WHERE EXISTS (SELECT FROM purchase pur
WHERE pur.person_id = per.person_id
AND pur.book_id = b1.book_id)
AND EXISTS (SELECT FROM purchase pur
WHERE pur.person_id = per.person_id
AND pur.book_id = b2.book_id)
) AS person_count
FROM book AS b1, book AS b2
WHERE b1.book_id < b2.book_id;
But unfortunately it is incredibly slow.
The execution plan looks like this:
Nested Loop (cost=0.29..1564886083010.59 rows=90420300 width=16)
-> Seq Scan on book b1 (cost=0.00..237.70 rows=16470 width=4)
-> Index Only Scan using book_pk on book b2 (cost=0.29..96.38 rows=5490 width=4)
Index Cond: (book_id > b1.book_id)
SubPlan 1
-> Aggregate (cost=17306.76..17306.77 rows=1 width=8)
-> Nested Loop (cost=1.14..17306.76 rows=1 width=0)
-> Nested Loop (cost=0.72..17238.11 rows=123 width=8)
-> Index Only Scan using purchase_idx on purchase pur_1 (cost=0.42..16791.97 rows=123 width=4)
Index Cond: (book_id = b2.book_id)
-> Index Only Scan using person_pk on person per (cost=0.29..3.63 rows=1 width=4)
Index Cond: (person_id = pur_1.person_id)
-> Index Only Scan using purchase_idx on purchase pur (cost=0.42..0.56 rows=1 width=4)
Index Cond: ((person_id = per.person_id) AND (book_id = b1.book_id))
JIT:
Functions: 13
Options: Inlining true, Optimization true, Expressions true, Deforming true
Seems pretty good in so far that the purchase_idx
index is used for pretty much everything inside the loop.
But why does it not use a parallel execution plan? I tried to go over all the requirements for parallel execution in the Postgres docs, but did not find anything that I would not be satisfied.
What can I do to speed the query up and get it to execute in parallel?
-
You have an index which duplicates the primary key. Clearly you instead need an index on purchase which starts with book_id.jjanes– jjanes2022年05月05日 17:34:58 +00:00Commented May 5, 2022 at 17:34
-
for each pair of books - I assume you mean for each pair of books bought by at least one person. We wouldn't want to list the presumably huge list of books that where never bought by the same person. (Or more than one?) Also please always disclose your version of Postgres. And cardinalities for our tables.Erwin Brandstetter– Erwin Brandstetter2022年05月06日 01:39:48 +00:00Commented May 6, 2022 at 1:39
3 Answers 3
This lists all pairs of books that were actually bought by at least one person. (All other pairs were never bought together, no point in diluting the result):
SELECT p1.book_id AS book1_id
, p2.book_id AS book2_id
, count(*) AS count_of_people_who_bought_both_books
FROM purchase p1
JOIN purchase p2 USING (person_id)
WHERE p1.book_id < p2.book_id
GROUP BY 1, 2
ORDER BY 1, 2;
Should be substantially faster.
All it needs is the existing PK index on (person_id, book_id)
- with person_id first, just like you have it. See:
Postgres gets it done with two index-only scans in my hands, provided the table purchase
has been vacuum'ed enough.
Assuming a proper many-to-many implementation with referential integrity enforced by FKs, there is no need to involve the tables person
and book
. That would just add cost. Related:
As opposed to your correlated subqueries (see Jeff's answer), this query plan can be parallelized, too.
Your query falls under this condition:
The following operations are always parallel restricted:
- Plan nodes that reference a correlated SubPlan.
The top nested loop could be parallelized, but it would have to be gathered before dispatching to the SubPlan. There is just no point in doing it that way.
-
Would there we a way to restructure the query so that it could use a parallel execution plan?Eerik Sven Puudist– Eerik Sven Puudist2022年05月05日 18:13:28 +00:00Commented May 5, 2022 at 18:13
You can achieve the same with something like
select book_1_id, book_2_id, sum(cnt) person_count
from (
select b1.book_id book_1_id, b2.book_id book_2_id, 0 cnt
from book b1
join book b2
on b1.book_id < b2.book_id
union all
select p1.book_id, p2.book_id, count(*) cnt
from purchase p1
join purchase p2
on p1.person_id = p2.person_id
and p1.book_id < p2.book_id
group by p1.book_id, p2.book_id
) Sq
group by book_1_id, book_2_id
This assumes that there's no rows in purchase
that don't correspond to a row in person
or book
(there's no foreign key at the moment). It also requires person_id,book_id
to be unique in purchase
- it is right now but if that's just for demo purposes you will need to distinct
in the query, like:
with unique_purchase as (select distinct person_id, book_id from purchase)
select book_1_id, book_2_id, sum(cnt) person_count
from (
select b1.book_id book_1_id, b2.book_id book_2_id, 0 cnt
from book b1
join book b2
on b1.book_id < b2.book_id
union all
select p1.book_id, p2.book_id, count(*) cnt
from unique_purchase p1
join unique_purchase p2
on p1.person_id = p2.person_id
and p1.book_id < p2.book_id
group by p1.book_id, p2.book_id
) sq
group by book_1_id, book_2_id
This performs faster for me, but it's still not going to be instant if you have lots of data due to the nature of the joins involved.
Explore related questions
See similar questions with these tags.