Optimizing a PostgreSQL query with many small lookups

Question 1

I have a database about books and book sales. The tables look like this:

CREATE TABLE purchase (
 person_id integer,
 book_id integer,
 CONSTRAINT purchase_pk PRIMARY KEY (person_id, book_id)
);
CREATE TABLE person (
 person_id integer PRIMARY KEY
);
CREATE TABLE book (
 book_id integer PRIMARY KEY
);
CREATE INDEX purchase_idx ON purchase (person_id, book_id);

I would need to know the number of people who have brought both of the books for each pair of books, so basically:

book_1_id	book_2_id	people_count_who_bought_both_books
1	2	45
1	3	789

I have defined the following query for this:

CREATE MATERIALIZED VIEW book_sales_stat AS
SELECT b1.book_id AS book_1_id,
 b2.book_id AS book_2_id,
 (SELECT count(*)
 FROM person AS per
 WHERE EXISTS (SELECT FROM purchase pur
 WHERE pur.person_id = per.person_id
 AND pur.book_id = b1.book_id)
 AND EXISTS (SELECT FROM purchase pur
 WHERE pur.person_id = per.person_id
 AND pur.book_id = b2.book_id)
 ) AS person_count
FROM book AS b1, book AS b2
WHERE b1.book_id < b2.book_id;

But unfortunately it is incredibly slow.

The execution plan looks like this:

 Nested Loop (cost=0.29..1564886083010.59 rows=90420300 width=16)
 -> Seq Scan on book b1 (cost=0.00..237.70 rows=16470 width=4)
 -> Index Only Scan using book_pk on book b2 (cost=0.29..96.38 rows=5490 width=4)
 Index Cond: (book_id > b1.book_id)
 SubPlan 1
 -> Aggregate (cost=17306.76..17306.77 rows=1 width=8)
 -> Nested Loop (cost=1.14..17306.76 rows=1 width=0)
 -> Nested Loop (cost=0.72..17238.11 rows=123 width=8)
 -> Index Only Scan using purchase_idx on purchase pur_1 (cost=0.42..16791.97 rows=123 width=4)
 Index Cond: (book_id = b2.book_id)
 -> Index Only Scan using person_pk on person per (cost=0.29..3.63 rows=1 width=4)
 Index Cond: (person_id = pur_1.person_id)
 -> Index Only Scan using purchase_idx on purchase pur (cost=0.42..0.56 rows=1 width=4)
 Index Cond: ((person_id = per.person_id) AND (book_id = b1.book_id))
 JIT:
 Functions: 13
 Options: Inlining true, Optimization true, Expressions true, Deforming true

Seems pretty good in so far that the purchase_idx index is used for pretty much everything inside the loop.

But why does it not use a parallel execution plan? I tried to go over all the requirements for parallel execution in the Postgres docs, but did not find anything that I would not be satisfied.

What can I do to speed the query up and get it to execute in parallel?

Question 2

You have an index which duplicates the primary key. Clearly you instead need an index on purchase which starts with book_id.

Question 3

for each pair of books - I assume you mean for each pair of books bought by at least one person. We wouldn't want to list the presumably huge list of books that where never bought by the same person. (Or more than one?) Also please always disclose your version of Postgres. And cardinalities for our tables.

Question 4

This lists all pairs of books that were actually bought by at least one person. (All other pairs were never bought together, no point in diluting the result):

SELECT p1.book_id AS book1_id
 , p2.book_id AS book2_id
 , count(*) AS count_of_people_who_bought_both_books
FROM purchase p1
JOIN purchase p2 USING (person_id)
WHERE p1.book_id < p2.book_id
GROUP BY 1, 2
ORDER BY 1, 2;

Should be substantially faster.

All it needs is the existing PK index on (person_id, book_id) - with person_id first, just like you have it. See:

Is a composite index also good for queries on the first field?

Postgres gets it done with two index-only scans in my hands, provided the table purchase has been vacuum'ed enough.

Assuming a proper many-to-many implementation with referential integrity enforced by FKs, there is no need to involve the tables person and book. That would just add cost. Related:

How to implement a many-to-many relationship in PostgreSQL?

As opposed to your correlated subqueries (see Jeff's answer), this query plan can be parallelized, too.

Question 5

Your query falls under this condition:

The following operations are always parallel restricted:

Plan nodes that reference a correlated SubPlan.

The top nested loop could be parallelized, but it would have to be gathered before dispatching to the SubPlan. There is just no point in doing it that way.

Question 6

Would there we a way to restructure the query so that it could use a parallel execution plan?

Question 7

You can achieve the same with something like

select book_1_id, book_2_id, sum(cnt) person_count
from (
select b1.book_id book_1_id, b2.book_id book_2_id, 0 cnt
from book b1
join book b2
 on b1.book_id < b2.book_id
union all
select p1.book_id, p2.book_id, count(*) cnt
from purchase p1
join purchase p2
 on p1.person_id = p2.person_id
 and p1.book_id < p2.book_id
group by p1.book_id, p2.book_id
) Sq
group by book_1_id, book_2_id

This assumes that there's no rows in purchase that don't correspond to a row in person or book (there's no foreign key at the moment). It also requires person_id,book_id to be unique in purchase - it is right now but if that's just for demo purposes you will need to distinct in the query, like:

with unique_purchase as (select distinct person_id, book_id from purchase)
select book_1_id, book_2_id, sum(cnt) person_count
from (
select b1.book_id book_1_id, b2.book_id book_2_id, 0 cnt
from book b1
join book b2
 on b1.book_id < b2.book_id
union all
select p1.book_id, p2.book_id, count(*) cnt
from unique_purchase p1
join unique_purchase p2
 on p1.person_id = p2.person_id
 and p1.book_id < p2.book_id
group by p1.book_id, p2.book_id
) sq
group by book_1_id, book_2_id

This performs faster for me, but it's still not going to be instant if you have lots of data due to the nature of the joins involved.

score 1 · Accepted Answer · 2022-05-06 02:04:07Z

This lists all pairs of books that were actually bought by at least one person. (All other pairs were never bought together, no point in diluting the result):

SELECT p1.book_id AS book1_id
 , p2.book_id AS book2_id
 , count(*) AS count_of_people_who_bought_both_books
FROM purchase p1
JOIN purchase p2 USING (person_id)
WHERE p1.book_id < p2.book_id
GROUP BY 1, 2
ORDER BY 1, 2;

Should be substantially faster.

All it needs is the existing PK index on (person_id, book_id) - with person_id first, just like you have it. See:

Is a composite index also good for queries on the first field?

Postgres gets it done with two index-only scans in my hands, provided the table purchase has been vacuum'ed enough.

Assuming a proper many-to-many implementation with referential integrity enforced by FKs, there is no need to involve the tables person and book. That would just add cost. Related:

How to implement a many-to-many relationship in PostgreSQL?

As opposed to your correlated subqueries (see Jeff's answer), this query plan can be parallelized, too.

Stack Exchange Network

Optimizing a PostgreSQL query with many small lookups

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions