How to find the most common strings between rows

Question 1

In SQL 2017, I have a 'food' table as follows:

ID item Ingredients
---- ----- -----------
1 item1 flour,sugar,cocoa,butter
2 item2 flour,sugar,butter,water
3 item3 sugar,cocoa,water
4 item4 sugar,cocoa,butter
5 item5 flour,water

Although the "Ingredients" are a CSV in a single field, I have no problem moving them to a related table:

ID Ingredient
--- ----------
 1 flour
 1 sugar
 1 cocoa
 1 butter
 2 flour

etc or just select them with a CROSS APPLY string_split

select ID, item, ingredient
from food
CROSS APPLY string_split(food.ingredients,',') as ing

Now here is what I am trying to attempt:

For any given row ex:(where id = 1) I want to return the top n rows (other than id = 1) that have the most common ingredients with id = 1.

So the top 2 with the most common ingredients with id = 1 should be:

ID item Ingredients
---- ----- -----------
2 item2 flour,sugar,butter,water ( 3 in common)
3 item3 sugar,cocoa,water ( 2 in common)
4 item4 sugar,cocoa,butter ( 3 in common)
5 item5 flour,water ( 1 in common)

The result should return rows with id = 2 and id = 4 as both have 3 ingredients in common with row id = 1

This is my first question on StackExchange, and I hope it is clear. I can write this as a C# or Python batch program to create a table of 'closest ingredients', but I'm hoping I can do this in SQL.

Thanks !

Question 2

Please store the ingredients in a table not a list. So let's call that ItemIngredient with two columns ItemId and Ingredient and leave it at that. The primary key will be (ItemId, Ingredient). Because searches by Ingredient will probably happen a lot, let's put a non-unique index on Ingredient (which is necessary for below to work the most efficiently).

SELECT
 ItemId
 ,MatchItemId
 ,IngredientMatchCount
 ,MatchRank
FROM
 (
 SELECT
 Base.ItemId
 ,Match.ItemId AS MatchItemId
 ,COUNT(*) AS IngredientMatchCount
 ,RANK() OVER (PARTITION BY Base.ItemId ORDER BY COUNT(*) DESC, Match.ItemId) AS MatchRank
 FROM
 ItemIngredient Base
 INNER JOIN
 ItemIngredient Match
 ON Match.Ingredient = Base.Ingredient 
 AND Match.ItemId <> Base.ItemId
 -- Put your WHERE clause here if you want to only fine matches for certain Ids
 GROUP BY
 Base.ItemId
 ,Match.ItemId
 ) matched
WHERE
 MatchRank <= 1 --Or whatever
ORDER BY
 ItemId

You can experiment with RANK vs DENSE_RANK at some point to see what provides the best results for you. If you need details about the Item you can join back in the outer query.

user212533user212533 · Accepted Answer · 2020-07-24 20:14:53Z

Please store the ingredients in a table not a list. So let's call that ItemIngredient with two columns ItemId and Ingredient and leave it at that. The primary key will be (ItemId, Ingredient). Because searches by Ingredient will probably happen a lot, let's put a non-unique index on Ingredient (which is necessary for below to work the most efficiently).

SELECT
 ItemId
 ,MatchItemId
 ,IngredientMatchCount
 ,MatchRank
FROM
 (
 SELECT
 Base.ItemId
 ,Match.ItemId AS MatchItemId
 ,COUNT(*) AS IngredientMatchCount
 ,RANK() OVER (PARTITION BY Base.ItemId ORDER BY COUNT(*) DESC, Match.ItemId) AS MatchRank
 FROM
 ItemIngredient Base
 INNER JOIN
 ItemIngredient Match
 ON Match.Ingredient = Base.Ingredient 
 AND Match.ItemId <> Base.ItemId
 -- Put your WHERE clause here if you want to only fine matches for certain Ids
 GROUP BY
 Base.ItemId
 ,Match.ItemId
 ) matched
WHERE
 MatchRank <= 1 --Or whatever
ORDER BY
 ItemId

You can experiment with RANK vs DENSE_RANK at some point to see what provides the best results for you. If you need details about the Item you can join back in the outer query.

Stack Exchange Network

How to find the most common strings between rows

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to find the most common strings between rows

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions