0

I need help designing a more efficient storage schema in Postgres for a large dataset and/or the subsequent queries that can be used to retrieve the data.

Here's the situation. I have a huge list (5M+) of products along with associated sub-products.

For example, here's a few notional product and sub-product groups. As you can see, sub-products may be associated with one or more main product.

Product1 Product2 Product3 Product4
 -SubProA -SubProD -SubProC -SubProA
 -SubProB -SubProE -SubProF -SubProD
 -SubProC -SubProG

What I need to do is given a particular sub-product, efficiently identify every other unique sub-products where the search sub-product is present. In the example, if I select SubProC, it should retrieve SubProA, SubProB and SubProF.

My initial attempt involved creating a product table, a sub-product table and then just a simple a relationship table between the two.

Products
------------------
| ID | Name |
| 1 | Product 1 |
| 2 | Product 2 |
| 3 | Product 3 |
| 4 | Product 4 |
Sub-Products
----------------
| ID | Name |
| 1 | SubProA |
| 2 | SubProB |
| 3 | SubProC |
| 4 | SubProD |
| 5 | SubProE |
Relationships
---------------------------------
| ID | ProductID | SubProductID |
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 1 | 3 |
| 4 | 2 | 4 |
| 5 | 2 | 5 |

Basically what I'm doing is a simple lookup on the sub-product to get the ID then querying for its corresponding ProductID from the relationship table then looping through all of those products to capture the other SubProductID while joined with the sub-product table to get the sub-product name.

select ID from sub_products where name = 1ドル;
result = select ProductID from relationships where SubProductID = 1ドル;
for each ProductID in result:
 select s.Name from sub_products s join relationships r on (s.ID = r.SubProductID) where r.ProductID = 1ドル;

While this would work well enough for small data sets, it is painfully slow for millions of records because of all of the individual lookups and loops. Some sub-products may be associated with 10's of thousands of products but the corresponding list of related sub-products is much smaller and my simple relationship table isn't capturing that efficiently.

So, is there a better data schema or even a set of queries I can use?

asked Apr 6, 2023 at 13:39

1 Answer 1

3

I think that your data model is basically good, but you should use a single query rather than loop explicitly:

SELECT DISTINCT s2.name
FROM subproduct AS s1
 JOIN relationship AS r1
 ON r1.subproductid = s1.id
 JOIN relationship AS r2
 ON r2.productid = r2.productid
 JOIN subproduct AS s2
 ON r2.subproductid = s2.id
WHERE s1.name = 'SubProdA';

Unless those queries select a big part of the respective tables, you need indexes on some join conditions to get an efficient nested loop join.

The relationship table doesn't need an artificial primary key; use the combination of the foreign keys as primary key.

answered Apr 6, 2023 at 14:12
2
  • thanks! I added hash indexes and they definitely speed up my looping queries (I also added some limit statements). Still pretty slow though with single query you provided. Would the relationship table foreign key combination add additional performance or is that just a good design practice? Commented Apr 6, 2023 at 16:36
  • Foreign keys provide integrity, not performance. Indexes on the columns with the foreign keys can boost performance. Commented Apr 7, 2023 at 3:01

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.