0

In SQL 2017, I have a 'food' table as follows:

ID item Ingredients
---- ----- -----------
1 item1 flour,sugar,cocoa,butter
2 item2 flour,sugar,butter,water
3 item3 sugar,cocoa,water
4 item4 sugar,cocoa,butter
5 item5 flour,water

Although the "Ingredients" are a CSV in a single field, I have no problem moving them to a related table:

ID Ingredient
--- ----------
 1 flour
 1 sugar
 1 cocoa
 1 butter
 2 flour

etc or just select them with a CROSS APPLY string_split

select ID, item, ingredient
from food
CROSS APPLY string_split(food.ingredients,',') as ing

Now here is what I am trying to attempt:

For any given row ex:(where id = 1) I want to return the top n rows (other than id = 1) that have the most common ingredients with id = 1.

So the top 2 with the most common ingredients with id = 1 should be:

ID item Ingredients
---- ----- -----------
2 item2 flour,sugar,butter,water ( 3 in common)
3 item3 sugar,cocoa,water ( 2 in common)
4 item4 sugar,cocoa,butter ( 3 in common)
5 item5 flour,water ( 1 in common)
 

The result should return rows with id = 2 and id = 4 as both have 3 ingredients in common with row id = 1

This is my first question on StackExchange, and I hope it is clear. I can write this as a C# or Python batch program to create a table of 'closest ingredients', but I'm hoping I can do this in SQL.

Thanks !

asked Jul 24, 2020 at 19:36

1 Answer 1

1

Please store the ingredients in a table not a list. So let's call that ItemIngredient with two columns ItemId and Ingredient and leave it at that. The primary key will be (ItemId, Ingredient). Because searches by Ingredient will probably happen a lot, let's put a non-unique index on Ingredient (which is necessary for below to work the most efficiently).

SELECT
 ItemId
 ,MatchItemId
 ,IngredientMatchCount
 ,MatchRank
FROM
 (
 SELECT
 Base.ItemId
 ,Match.ItemId AS MatchItemId
 ,COUNT(*) AS IngredientMatchCount
 ,RANK() OVER (PARTITION BY Base.ItemId ORDER BY COUNT(*) DESC, Match.ItemId) AS MatchRank
 FROM
 ItemIngredient Base
 INNER JOIN
 ItemIngredient Match
 ON Match.Ingredient = Base.Ingredient 
 AND Match.ItemId <> Base.ItemId
 -- Put your WHERE clause here if you want to only fine matches for certain Ids
 GROUP BY
 Base.ItemId
 ,Match.ItemId
 ) matched
WHERE
 MatchRank <= 1 --Or whatever
ORDER BY
 ItemId
 

You can experiment with RANK vs DENSE_RANK at some point to see what provides the best results for you. If you need details about the Item you can join back in the outer query.

answered Jul 24, 2020 at 20:14

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.