mysql - find duplicate sets of data grouped by foreign key

Question 1

I have a large mysql table (500,000 records). I need to find sets of data that have all the same attributes (KEY, NAME AND VALUE). Some sets have 20 attributes others have 397 per each KEY. The table looks like this. KEY is a foreign key.

ID, KEY, NAME, VALUE
1 87 Color Red
2 87 Size Big
3 87 Weight 6
4 85 Color Red
5 85 Size Big
6 85 Weight 6
7 96 Color Red
8 96 Size Small
8 96 Weight 7

I'm trying to write a query where if KEY=87 it finds KEY 85 matches all attributes and values like KEY 87. There are 397 different attributes rather than just the 3 I show here. I can do this in php but it's sloppy and I want to learn more about mysql but I can't get my head around it. Appreciate any help in advance.

Question 2

I agree about EAV but if I was to abandon it I would end up with over 397 columns in my table to account for all the attributes and many would be NULL.

Question 3

Does it have to be done using a single query or would you accept a solution using a Stored Procedure?

Question 4

In a subquery, fetch all the attributes (I cringe at EAV, too).
GROUP_CONCAT(... ORDER BY ...) as foo to combine them into a single string.
In the outer query, SELECT GROUP_CONCAT(ky) ... GROUP BY foo HAVING COUNT(*) > 1 to get the duplicate kys. (Please don't use the keyword "KEY".)

Give that a try. If you get in trouble, show us SHOW CREATE TABLE and your failed attempt. (Or abandon EAV.)

Some code (not quite mimicking the above prose):

SELECT `key`, CONCAT(`name`, '=', `value`) AS kv FROM tbl

should give you

key kv
87 Color=Red
87 Size=Big
...

Then...

SELECT `key`,
 GROUP_CONCAT(kv SEPARATOR '|' ORDER BY kv) AS kvs
 FROM ( -- the above query -- ) AS t1

should give you

key kvs
87 Color=Red|Size=Big|...
85 ...

Now...

SELECT kvs,
 GROUP_CONCAT(`key`) AS keys
 FROM ( -- second select -- ) AS t2
 GROUP BY kvs
 HAVING COUNT(*) > 1

should give you

kvs keys
Color=Red|Size=Big|Weight=6 87,85
... (perhaps not including 96)

I suggest you debug my code the way I wrote it -- test the first query to see if it gives you the first set of output, then second, then third.

Question 5

I am not sure why someone posted a bounty on this question, This looks like a reasonable answer.

Question 6

@Imran - You have a similar task?

Question 7

For those interested in this solution I have made up two fiddles of it (MySQL and just for laughs, PostgreSQL). @RickJames - very nice +1! I learnt a lot doing the fiddles - thanks. GROUP_CONCAT/STRING_AGG are really nifty and powerful in the right circumstances - real eye-opener for me! Care to critique my solution?

Question 8

@Vérace - "in the right circumstances" -- GROUP_CONCAT has a default limit of 1024 bytes, but that can be tuned. Still, it is limited to RAM.

Question 9

Yes, I was aware of that, thanks. And thanks for your input to my answer - I'll add a bit more data and see how that affects the no. of records resulting from the CROSS JOIN - it's an interesting CS problem.

Question 10

TL;DR The problem can be solved using SQL's `LEAST` and `GREATEST` functions. A quick look here will tell you what's wrong with the EAV data model!

It can be done this way (tested in 5.6, 5.7 and 8) using (relatively) standard SQL with no requirement for CTEs, Window functions or GROUP_CONCATs or anything fancy like that. I have decided to outline my thought processes rather than simply provide an answer. The fiddles using the OP's sample data shown below is to be found here and the fiddle for extra test data, notably with matching singleton records (discussed at the end) is available here.

Create your table (I have removed any SQL keywords as fieldnames):

CREATE TABLE tab
(
 t_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
 t_key INTEGER NOT NULL,
 t_name VARCHAR(10) NOT NULL,
 t_value VARCHAR(10) NOT NULL
);

Populate the table:

(85, 'Color', 'Red'),
(85, 'Size', 'Big'),
(85, 'Weight', '6'),
(86, 'Color', 'Red'),
(86, 'Size', 'Big'),
(86, 'Weight', '6'),
(87, 'Color', 'Red'),
(87, 'Size', 'Small'),
(87, 'Weight', '7');

Initially run this query:

SELECT t1.t_key AS "t1key", t1.t_name AS "t1name", t1.t_value AS "t1value",
 t2.t_key AS "t2key", t2.t_name AS "t2name", t2.t_value AS "t2value"
FROM tab t1
JOIN tab t2
ON t1.t_key != t2.t_key
 AND t1.t_name = t2.t_name
 AND t1.t_value = t2.t_value
ORDER BY t1.t_id, t2.t_id;

Result (first 4 rows only of 10 for brevity):

t1key t1name t1value t2key t2name t2value
 85 Color Red 86 Color Red
 85 Color Red 87 Color Red
 85 Size Big 86 Size Big
 85 Weight 6 86 Weight 6

Then, I run this:

SELECT 
 DISTINCT LEAST(t1key, t2key) AS "lst", 
 GREATEST(t1key, t2key) AS "gst",
 COUNT(LEAST(t1key, t2key)) AS "mn_c",
 COUNT(GREATEST(t1key, t2key)) AS "mx_c"
FROM
(
 SELECT t1.t_key AS "t1key", t1.t_name AS "t1name", t1.t_value AS "t1value",
 t2.t_key AS "t2key", t2.t_name AS "t2name", t2.t_value AS "t2value"
 FROM tab t1
 JOIN tab t2
 ON t1.t_key != t2.t_key
 AND t1.t_name = t2.t_name
 AND t1.t_value = t2.t_value
 ORDER BY t1.t_id, t2.t_id
)
AS t1
GROUP BY t1.t1key, t1.t2key
ORDER BY 1, 2, 3;

Result:

lst gst mn_c mx_c
85 86 3 3
85 87 1 1
86 87 1 1

We want a unique combination of the JOIN of t_keys and a COUNT of same.

Now, we know thatt_keys 85 and 86 are perfect matches, but that 85 and 86 will match 87 on only 2 out of 3 t_names and t_values, so it's no surprise that they appear in the query, but we need a way to determine if they match on all the variables. This is determined by the mn_c, mx_c COUNTs.

If the records match on all t_names and t_values, then the COUNT will be the same as the no. of records with those keys - in this case 3. We can see this from inspection - 85 and 86 match on 3 criteria, no more and no less than the record count for both.

So, adding a:

HAVING
COUNT(LEAST(t1key, t2key)) 
 = (SELECT COUNT(tab.t_key) FROM tab WHERE t_key = t1.t1key)
AND
COUNT(GREATEST(t1key, t2key))
 = (SELECT COUNT(tab.t_key) FROM tab WHERE t_key = t1.t2key)

Which checks the COUNT of LEAST and GREATEST against the count of records with that t_keys COUNT in the original tab.

Result:

lst gst mn_c
 85 86 3

Which we know to be the matching count for the records on all values for t_name and t_value.

I added some records to test edge cases (see end of post (and fiddle) for all data used in the final tests):

(92, 'Couleur', 'Rouge'),
(92, 'Poids', 'Leger'),
(93, 'Couleur', 'Rouge'),
(93, 'Poids', 'Leger'),
(94, 'Couleur', 'Rouge'),
(95, 'Couleur', 'Rouge');

Gives the result:

lst gst mn_c
 75 76 4 -- You can explore these matching records in the fiddle.
 75 78 4 -- This was another test of edge cases to see if
 76 78 4 -- multiple matching (i.e. > 2) `t_key`s show up correctly. They do
 85 86 3
 92 93 2
 94 95 1

This is correct - 92 and 93 match, but I wanted to test a singleton as an edge case and indeed singletons which don't match are not (data not shown) - you can add a record like:

(1001, 'Poids', 'Lourd')

or any other singleton and there will be no match.

However, the absolutely CRITICAL edge case was when t_keys with only 1 record matched. These really tested my various solutions and helped me arrive at the correct one.

Finally, the orginal question asks: I'm trying to write a query where if KEY=87 it finds KEY 85 matches all attributes and values like KEY 87. This can be done by adding a simple WHERE clause thus

WHERE LEAST(t1key, t2key) = 75 OR GREATEST(t1key, t2key) = 75 -- actual no. not important. It works - see fiddle.

Which gives:

lst gst mn_c
 75 76 4
 75 78 4

This is the correct answer. Fiddle here.

A note of interest. I ran my first solution (which worked on MySQL) though a PostgreSQL fiddle and it failed. This led me to find a solution which works with PostgreSQL and this also allowed me to improve my original MySQL answer to the point that I now believe it is very solid. PostgreSQL fiddle available here. If working with MySQL, I would recommend that you also run your solutions through a PostgreSQL fiddle.

Two posts really helped with this answer. The first was this one by @KennethFisher - it's far from identical, but Kenneth's elegant solution gave me a start - using COUNTs and DISTINCTs for example. At the risk of seeming to blow my own trumpet, the inspiration for using LEAST and GREATEST came from here, my answer to an also related question concerning matching sets of numbers.

Full test data:

INSERT INTO tab (t_key, t_name, t_value)
VALUES
(75, 'Couleur', 'Bleu'), 
(75, 'Taille', 'Grand'),
(75, 'Poids', '20'),
(75, 'Teint', 'Y'),
(76, 'Couleur', 'Bleu'),
(76, 'Taille', 'Grand'),
(76, 'Poids', '20'),
(76, 'Teint', 'Y'),
(77, 'Couleur', 'Bleu'),
(77, 'Taille', 'Grand'),
(77, 'Poids', '20'),
(77, 'Teint', 'N'),
(78, 'Couleur', 'Bleu'),
(78, 'Taille', 'Grand'),
(78, 'Poids', '20'),
(78, 'Teint', 'Y'),
(85, 'Color', 'Red'),
(85, 'Size', 'Big'),
(85, 'Weight', '6'),
(86, 'Color', 'Red'),
(86, 'Size', 'Big'),
(86, 'Weight', '6'),
(87, 'Color', 'Red'),
(87, 'Size', 'Small'),
(87, 'Weight', '7'),
(92, 'Couleur', 'Rouge'),
(92, 'Poids', 'Leger'),
(93, 'Couleur', 'Rouge'),
(93, 'Poids', 'Leger'),
(94, 'Couleur', 'Rouge'),
(95, 'Couleur', 'Rouge');

Question 11

Well, for my sample data of 31 rows, there's only 66 rows in the CROSS JOIN, so ~ O(2N). The OP says the range of matching attributes is 20 - 397 which should severely restrict the number of records resulting from the CROSS JOIN AFAICS.

Question 12

I stand corrected.

Question 13

When the values are in multiple rows in a single column, WHERE MIN(col)=MAX(col) says "they are all the same. I wonder if that would be useful to do (instead of moving the values into multiple columns in a single row).

Question 14

@RickJames I added a further 60 records (2 sets of 30) with different attributes giving a total of 93 and the CROSS JOIN produces 160 records (< O(2N) - fiddle here). This will vary according to the nature of the attributes, but with good indexing I think performance should be OK. I've reread your comment above (When the values are in multiple rows in a single column,...) and am still unclear. Could you possibly clarify a bit more?

Question 15

Think of it as another way to discover that COUNT(DISTINCT col) = 1

Rick James Rick James 80.7k5 gold badges52 silver badges119 bronze badges · Answer 1 · 2017-09-26 04:11:58Z

In a subquery, fetch all the attributes (I cringe at EAV, too).
GROUP_CONCAT(... ORDER BY ...) as foo to combine them into a single string.
In the outer query, SELECT GROUP_CONCAT(ky) ... GROUP BY foo HAVING COUNT(*) > 1 to get the duplicate kys. (Please don't use the keyword "KEY".)

Give that a try. If you get in trouble, show us SHOW CREATE TABLE and your failed attempt. (Or abandon EAV.)

Some code (not quite mimicking the above prose):

SELECT `key`, CONCAT(`name`, '=', `value`) AS kv FROM tbl

should give you

key kv
87 Color=Red
87 Size=Big
...

Then...

SELECT `key`,
 GROUP_CONCAT(kv SEPARATOR '|' ORDER BY kv) AS kvs
 FROM ( -- the above query -- ) AS t1

should give you

key kvs
87 Color=Red|Size=Big|...
85 ...

Now...

SELECT kvs,
 GROUP_CONCAT(`key`) AS keys
 FROM ( -- second select -- ) AS t2
 GROUP BY kvs
 HAVING COUNT(*) > 1

should give you

kvs keys
Color=Red|Size=Big|Weight=6 87,85
... (perhaps not including 96)

I suggest you debug my code the way I wrote it -- test the first query to see if it gives you the first set of output, then second, then third.

I am not sure why someone posted a bounty on this question, This looks like a reasonable answer.
For those interested in this solution I have made up two fiddles of it (MySQL and just for laughs, PostgreSQL). @RickJames - very nice +1! I learnt a lot doing the fiddles - thanks. GROUP_CONCAT/STRING_AGG are really nifty and powerful in the right circumstances - real eye-opener for me! Care to critique my solution?
@Vérace - "in the right circumstances" -- GROUP_CONCAT has a default limit of 1024 bytes, but that can be tuned. Still, it is limited to RAM.
Yes, I was aware of that, thanks. And thanks for your input to my answer - I'll add a bit more data and see how that affects the no. of records resulting from the CROSS JOIN - it's an interesting CS problem.

Vérace Vérace 31k9 gold badges73 silver badges86 bronze badges · Answer 2 · 2019-09-07 17:32:55Z

TL;DR The problem can be solved using SQL's `LEAST` and `GREATEST` functions. A quick look here will tell you what's wrong with the EAV data model!

It can be done this way (tested in 5.6, 5.7 and 8) using (relatively) standard SQL with no requirement for CTEs, Window functions or GROUP_CONCATs or anything fancy like that. I have decided to outline my thought processes rather than simply provide an answer. The fiddles using the OP's sample data shown below is to be found here and the fiddle for extra test data, notably with matching singleton records (discussed at the end) is available here.

Create your table (I have removed any SQL keywords as fieldnames):

CREATE TABLE tab
(
 t_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
 t_key INTEGER NOT NULL,
 t_name VARCHAR(10) NOT NULL,
 t_value VARCHAR(10) NOT NULL
);

Populate the table:

(85, 'Color', 'Red'),
(85, 'Size', 'Big'),
(85, 'Weight', '6'),
(86, 'Color', 'Red'),
(86, 'Size', 'Big'),
(86, 'Weight', '6'),
(87, 'Color', 'Red'),
(87, 'Size', 'Small'),
(87, 'Weight', '7');

Initially run this query:

SELECT t1.t_key AS "t1key", t1.t_name AS "t1name", t1.t_value AS "t1value",
 t2.t_key AS "t2key", t2.t_name AS "t2name", t2.t_value AS "t2value"
FROM tab t1
JOIN tab t2
ON t1.t_key != t2.t_key
 AND t1.t_name = t2.t_name
 AND t1.t_value = t2.t_value
ORDER BY t1.t_id, t2.t_id;

Result (first 4 rows only of 10 for brevity):

t1key t1name t1value t2key t2name t2value
 85 Color Red 86 Color Red
 85 Color Red 87 Color Red
 85 Size Big 86 Size Big
 85 Weight 6 86 Weight 6

Then, I run this:

SELECT 
 DISTINCT LEAST(t1key, t2key) AS "lst", 
 GREATEST(t1key, t2key) AS "gst",
 COUNT(LEAST(t1key, t2key)) AS "mn_c",
 COUNT(GREATEST(t1key, t2key)) AS "mx_c"
FROM
(
 SELECT t1.t_key AS "t1key", t1.t_name AS "t1name", t1.t_value AS "t1value",
 t2.t_key AS "t2key", t2.t_name AS "t2name", t2.t_value AS "t2value"
 FROM tab t1
 JOIN tab t2
 ON t1.t_key != t2.t_key
 AND t1.t_name = t2.t_name
 AND t1.t_value = t2.t_value
 ORDER BY t1.t_id, t2.t_id
)
AS t1
GROUP BY t1.t1key, t1.t2key
ORDER BY 1, 2, 3;

Result:

lst gst mn_c mx_c
85 86 3 3
85 87 1 1
86 87 1 1

We want a unique combination of the JOIN of t_keys and a COUNT of same.

Now, we know thatt_keys 85 and 86 are perfect matches, but that 85 and 86 will match 87 on only 2 out of 3 t_names and t_values, so it's no surprise that they appear in the query, but we need a way to determine if they match on all the variables. This is determined by the mn_c, mx_c COUNTs.

If the records match on all t_names and t_values, then the COUNT will be the same as the no. of records with those keys - in this case 3. We can see this from inspection - 85 and 86 match on 3 criteria, no more and no less than the record count for both.

So, adding a:

HAVING
COUNT(LEAST(t1key, t2key)) 
 = (SELECT COUNT(tab.t_key) FROM tab WHERE t_key = t1.t1key)
AND
COUNT(GREATEST(t1key, t2key))
 = (SELECT COUNT(tab.t_key) FROM tab WHERE t_key = t1.t2key)

Which checks the COUNT of LEAST and GREATEST against the count of records with that t_keys COUNT in the original tab.

Result:

lst gst mn_c
 85 86 3

Which we know to be the matching count for the records on all values for t_name and t_value.

I added some records to test edge cases (see end of post (and fiddle) for all data used in the final tests):

(92, 'Couleur', 'Rouge'),
(92, 'Poids', 'Leger'),
(93, 'Couleur', 'Rouge'),
(93, 'Poids', 'Leger'),
(94, 'Couleur', 'Rouge'),
(95, 'Couleur', 'Rouge');

Gives the result:

lst gst mn_c
 75 76 4 -- You can explore these matching records in the fiddle.
 75 78 4 -- This was another test of edge cases to see if
 76 78 4 -- multiple matching (i.e. > 2) `t_key`s show up correctly. They do
 85 86 3
 92 93 2
 94 95 1

This is correct - 92 and 93 match, but I wanted to test a singleton as an edge case and indeed singletons which don't match are not (data not shown) - you can add a record like:

(1001, 'Poids', 'Lourd')

or any other singleton and there will be no match.

However, the absolutely CRITICAL edge case was when t_keys with only 1 record matched. These really tested my various solutions and helped me arrive at the correct one.

Finally, the orginal question asks: I'm trying to write a query where if KEY=87 it finds KEY 85 matches all attributes and values like KEY 87. This can be done by adding a simple WHERE clause thus

WHERE LEAST(t1key, t2key) = 75 OR GREATEST(t1key, t2key) = 75 -- actual no. not important. It works - see fiddle.

Which gives:

lst gst mn_c
 75 76 4
 75 78 4

This is the correct answer. Fiddle here.

A note of interest. I ran my first solution (which worked on MySQL) though a PostgreSQL fiddle and it failed. This led me to find a solution which works with PostgreSQL and this also allowed me to improve my original MySQL answer to the point that I now believe it is very solid. PostgreSQL fiddle available here. If working with MySQL, I would recommend that you also run your solutions through a PostgreSQL fiddle.

Two posts really helped with this answer. The first was this one by @KennethFisher - it's far from identical, but Kenneth's elegant solution gave me a start - using COUNTs and DISTINCTs for example. At the risk of seeming to blow my own trumpet, the inspiration for using LEAST and GREATEST came from here, my answer to an also related question concerning matching sets of numbers.

Full test data:

INSERT INTO tab (t_key, t_name, t_value)
VALUES
(75, 'Couleur', 'Bleu'), 
(75, 'Taille', 'Grand'),
(75, 'Poids', '20'),
(75, 'Teint', 'Y'),
(76, 'Couleur', 'Bleu'),
(76, 'Taille', 'Grand'),
(76, 'Poids', '20'),
(76, 'Teint', 'Y'),
(77, 'Couleur', 'Bleu'),
(77, 'Taille', 'Grand'),
(77, 'Poids', '20'),
(77, 'Teint', 'N'),
(78, 'Couleur', 'Bleu'),
(78, 'Taille', 'Grand'),
(78, 'Poids', '20'),
(78, 'Teint', 'Y'),
(85, 'Color', 'Red'),
(85, 'Size', 'Big'),
(85, 'Weight', '6'),
(86, 'Color', 'Red'),
(86, 'Size', 'Big'),
(86, 'Weight', '6'),
(87, 'Color', 'Red'),
(87, 'Size', 'Small'),
(87, 'Weight', '7'),
(92, 'Couleur', 'Rouge'),
(92, 'Poids', 'Leger'),
(93, 'Couleur', 'Rouge'),
(93, 'Poids', 'Leger'),
(94, 'Couleur', 'Rouge'),
(95, 'Couleur', 'Rouge');

Well, for my sample data of 31 rows, there's only 66 rows in the CROSS JOIN, so ~ O(2N). The OP says the range of matching attributes is 20 - 397 which should severely restrict the number of records resulting from the CROSS JOIN AFAICS.
When the values are in multiple rows in a single column, WHERE MIN(col)=MAX(col) says "they are all the same. I wonder if that would be useful to do (instead of moving the values into multiple columns in a single row).
@RickJames I added a further 60 records (2 sets of 30) with different attributes giving a total of 93 and the CROSS JOIN produces 160 records (< O(2N) - fiddle here). This will vary according to the nature of the attributes, but with good indexing I think performance should be OK. I've reread your comment above (When the values are in multiple rows in a single column,...) and am still unclear. Could you possibly clarify a bit more?
Think of it as another way to discover that COUNT(DISTINCT col) = 1

Stack Exchange Network

mysql - find duplicate sets of data grouped by foreign key

2 Answers 2

TL;DR The problem can be solved using SQL's `LEAST` and `GREATEST` functions. A quick look here will tell you what's wrong with the EAV data model!

Full test data:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

mysql - find duplicate sets of data grouped by foreign key

2 Answers 2

TL;DR The problem can be solved using SQL's LEAST and GREATEST functions. A quick look here will tell you what's wrong with the EAV data model!

Full test data:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

TL;DR The problem can be solved using SQL's `LEAST` and `GREATEST` functions. A quick look here will tell you what's wrong with the EAV data model!