Help designing more efficient Postgres data schema and queries

Question 1

I need help designing a more efficient storage schema in Postgres for a large dataset and/or the subsequent queries that can be used to retrieve the data.

Here's the situation. I have a huge list (5M+) of products along with associated sub-products.

For example, here's a few notional product and sub-product groups. As you can see, sub-products may be associated with one or more main product.

Product1 Product2 Product3 Product4
 -SubProA -SubProD -SubProC -SubProA
 -SubProB -SubProE -SubProF -SubProD
 -SubProC -SubProG

What I need to do is given a particular sub-product, efficiently identify every other unique sub-products where the search sub-product is present. In the example, if I select SubProC, it should retrieve SubProA, SubProB and SubProF.

My initial attempt involved creating a product table, a sub-product table and then just a simple a relationship table between the two.

Products
------------------
| ID | Name |
| 1 | Product 1 |
| 2 | Product 2 |
| 3 | Product 3 |
| 4 | Product 4 |
Sub-Products
----------------
| ID | Name |
| 1 | SubProA |
| 2 | SubProB |
| 3 | SubProC |
| 4 | SubProD |
| 5 | SubProE |
Relationships
---------------------------------
| ID | ProductID | SubProductID |
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 1 | 3 |
| 4 | 2 | 4 |
| 5 | 2 | 5 |

Basically what I'm doing is a simple lookup on the sub-product to get the ID then querying for its corresponding ProductID from the relationship table then looping through all of those products to capture the other SubProductID while joined with the sub-product table to get the sub-product name.

select ID from sub_products where name = 1ドル;
result = select ProductID from relationships where SubProductID = 1ドル;
for each ProductID in result:
 select s.Name from sub_products s join relationships r on (s.ID = r.SubProductID) where r.ProductID = 1ドル;

While this would work well enough for small data sets, it is painfully slow for millions of records because of all of the individual lookups and loops. Some sub-products may be associated with 10's of thousands of products but the corresponding list of related sub-products is much smaller and my simple relationship table isn't capturing that efficiently.

So, is there a better data schema or even a set of queries I can use?

Question 2

I think that your data model is basically good, but you should use a single query rather than loop explicitly:

SELECT DISTINCT s2.name
FROM subproduct AS s1
 JOIN relationship AS r1
 ON r1.subproductid = s1.id
 JOIN relationship AS r2
 ON r2.productid = r2.productid
 JOIN subproduct AS s2
 ON r2.subproductid = s2.id
WHERE s1.name = 'SubProdA';

Unless those queries select a big part of the respective tables, you need indexes on some join conditions to get an efficient nested loop join.

The relationship table doesn't need an artificial primary key; use the combination of the foreign keys as primary key.

Question 3

thanks! I added hash indexes and they definitely speed up my looping queries (I also added some limit statements). Still pretty slow though with single query you provided. Would the relationship table foreign key combination add additional performance or is that just a good design practice?

Question 4

Foreign keys provide integrity, not performance. Indexes on the columns with the foreign keys can boost performance.

Laurenz Albe Laurenz Albe 62k4 gold badges57 silver badges93 bronze badges · Accepted Answer · 2023-04-06 14:12:04Z

I think that your data model is basically good, but you should use a single query rather than loop explicitly:

SELECT DISTINCT s2.name
FROM subproduct AS s1
 JOIN relationship AS r1
 ON r1.subproductid = s1.id
 JOIN relationship AS r2
 ON r2.productid = r2.productid
 JOIN subproduct AS s2
 ON r2.subproductid = s2.id
WHERE s1.name = 'SubProdA';

Unless those queries select a big part of the respective tables, you need indexes on some join conditions to get an efficient nested loop join.

The relationship table doesn't need an artificial primary key; use the combination of the foreign keys as primary key.

thanks! I added hash indexes and they definitely speed up my looping queries (I also added some limit statements). Still pretty slow though with single query you provided. Would the relationship table foreign key combination add additional performance or is that just a good design practice?
Foreign keys provide integrity, not performance. Indexes on the columns with the foreign keys can boost performance.

Stack Exchange Network

Help designing more efficient Postgres data schema and queries

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Help designing more efficient Postgres data schema and queries

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions