I have a large mysql table (500,000 records). I need to find sets of data that have all the same attributes (KEY, NAME AND VALUE). Some sets have 20 attributes others have 397 per each KEY. The table looks like this. KEY is a foreign key.
ID, KEY, NAME, VALUE 1 87 Color Red 2 87 Size Big 3 87 Weight 6 4 85 Color Red 5 85 Size Big 6 85 Weight 6 7 96 Color Red 8 96 Size Small 8 96 Weight 7
I'm trying to write a query where if KEY=87 it finds KEY 85 matches all attributes and values like KEY 87. There are 397 different attributes rather than just the 3 I show here. I can do this in php but it's sloppy and I want to learn more about mysql but I can't get my head around it. Appreciate any help in advance.
-
I agree about EAV but if I was to abandon it I would end up with over 397 columns in my table to account for all the attributes and many would be NULL.abide– abide2017年09月27日 14:57:32 +00:00Commented Sep 27, 2017 at 14:57
-
Does it have to be done using a single query or would you accept a solution using a Stored Procedure?KevH– KevH2019年09月02日 12:33:15 +00:00Commented Sep 2, 2019 at 12:33
2 Answers 2
- In a subquery, fetch all the attributes (I cringe at EAV, too).
GROUP_CONCAT(... ORDER BY ...) as foo
to combine them into a single string.- In the outer query,
SELECT GROUP_CONCAT(ky) ... GROUP BY foo HAVING COUNT(*) > 1
to get the duplicate kys. (Please don't use the keyword "KEY".)
Give that a try. If you get in trouble, show us SHOW CREATE TABLE
and your failed attempt. (Or abandon EAV.)
Some code (not quite mimicking the above prose):
SELECT `key`, CONCAT(`name`, '=', `value`) AS kv FROM tbl
should give you
key kv
87 Color=Red
87 Size=Big
...
Then...
SELECT `key`,
GROUP_CONCAT(kv SEPARATOR '|' ORDER BY kv) AS kvs
FROM ( -- the above query -- ) AS t1
should give you
key kvs
87 Color=Red|Size=Big|...
85 ...
Now...
SELECT kvs,
GROUP_CONCAT(`key`) AS keys
FROM ( -- second select -- ) AS t2
GROUP BY kvs
HAVING COUNT(*) > 1
should give you
kvs keys
Color=Red|Size=Big|Weight=6 87,85
... (perhaps not including 96)
I suggest you debug my code the way I wrote it -- test the first query to see if it gives you the first set of output, then second, then third.
-
I am not sure why someone posted a bounty on this question, This looks like a reasonable answer.James Jenkins– James Jenkins2019年09月06日 17:21:36 +00:00Commented Sep 6, 2019 at 17:21
-
@Imran - You have a similar task?Rick James– Rick James2019年09月06日 17:23:50 +00:00Commented Sep 6, 2019 at 17:23
-
For those interested in this solution I have made up two fiddles of it (MySQL and just for laughs, PostgreSQL). @RickJames - very nice +1! I learnt a lot doing the fiddles - thanks.
GROUP_CONCAT/STRING_AGG
are really nifty and powerful in the right circumstances - real eye-opener for me! Care to critique my solution?Vérace– Vérace2019年09月08日 20:51:52 +00:00Commented Sep 8, 2019 at 20:51 -
@Vérace - "in the right circumstances" --
GROUP_CONCAT
has a default limit of 1024 bytes, but that can be tuned. Still, it is limited to RAM.Rick James– Rick James2019年09月08日 21:42:51 +00:00Commented Sep 8, 2019 at 21:42 -
Yes, I was aware of that, thanks. And thanks for your input to my answer - I'll add a bit more data and see how that affects the no. of records resulting from the
CROSS JOIN
- it's an interesting CS problem.Vérace– Vérace2019年09月08日 21:47:12 +00:00Commented Sep 8, 2019 at 21:47
TL;DR The problem can be solved using SQL's LEAST
and GREATEST
functions. A quick look here will tell you what's wrong with the EAV data model!
It can be done this way (tested in 5.6, 5.7 and 8) using (relatively) standard SQL with no requirement for CTE
s, Window functions or GROUP_CONCAT
s or anything fancy like that. I have decided to outline my thought processes rather than simply provide an answer. The fiddles using the OP's sample data shown below is to be found here and the fiddle for extra test data, notably with matching singleton records (discussed at the end) is available here.
Create your table (I have removed any SQL keywords as fieldnames):
CREATE TABLE tab
(
t_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
t_key INTEGER NOT NULL,
t_name VARCHAR(10) NOT NULL,
t_value VARCHAR(10) NOT NULL
);
Populate the table:
(85, 'Color', 'Red'),
(85, 'Size', 'Big'),
(85, 'Weight', '6'),
(86, 'Color', 'Red'),
(86, 'Size', 'Big'),
(86, 'Weight', '6'),
(87, 'Color', 'Red'),
(87, 'Size', 'Small'),
(87, 'Weight', '7');
Initially run this query:
SELECT t1.t_key AS "t1key", t1.t_name AS "t1name", t1.t_value AS "t1value",
t2.t_key AS "t2key", t2.t_name AS "t2name", t2.t_value AS "t2value"
FROM tab t1
JOIN tab t2
ON t1.t_key != t2.t_key
AND t1.t_name = t2.t_name
AND t1.t_value = t2.t_value
ORDER BY t1.t_id, t2.t_id;
Result (first 4 rows only of 10 for brevity):
t1key t1name t1value t2key t2name t2value
85 Color Red 86 Color Red
85 Color Red 87 Color Red
85 Size Big 86 Size Big
85 Weight 6 86 Weight 6
Then, I run this:
SELECT
DISTINCT LEAST(t1key, t2key) AS "lst",
GREATEST(t1key, t2key) AS "gst",
COUNT(LEAST(t1key, t2key)) AS "mn_c",
COUNT(GREATEST(t1key, t2key)) AS "mx_c"
FROM
(
SELECT t1.t_key AS "t1key", t1.t_name AS "t1name", t1.t_value AS "t1value",
t2.t_key AS "t2key", t2.t_name AS "t2name", t2.t_value AS "t2value"
FROM tab t1
JOIN tab t2
ON t1.t_key != t2.t_key
AND t1.t_name = t2.t_name
AND t1.t_value = t2.t_value
ORDER BY t1.t_id, t2.t_id
)
AS t1
GROUP BY t1.t1key, t1.t2key
ORDER BY 1, 2, 3;
Result:
lst gst mn_c mx_c
85 86 3 3
85 87 1 1
86 87 1 1
We want a unique combination of the JOIN
of t_key
s and a COUNT
of same.
Now, we know thatt_keys
85
and 86
are perfect matches, but that 85
and 86
will match 87
on only 2 out of 3 t_name
s and t_values
, so it's no surprise that they appear in the query, but we need a way to determine if they match on all
the variables. This is determined by the mn_c
, mx_c
COUNT
s.
If the records match on all t_name
s and t_value
s, then the COUNT
will be the same as the no. of records with those keys - in this case 3. We can see this from inspection - 85 and 86 match on 3 criteria, no more and no less than the record count for both.
So, adding a:
HAVING
COUNT(LEAST(t1key, t2key))
= (SELECT COUNT(tab.t_key) FROM tab WHERE t_key = t1.t1key)
AND
COUNT(GREATEST(t1key, t2key))
= (SELECT COUNT(tab.t_key) FROM tab WHERE t_key = t1.t2key)
Which checks the COUNT
of LEAST
and GREATEST
against the count of records with that t_key
s COUNT
in the original tab
.
Result:
lst gst mn_c
85 86 3
Which we know to be the matching count for the records on all values for t_name
and t_value
.
I added some records to test edge cases (see end of post (and fiddle) for all data used in the final tests):
(92, 'Couleur', 'Rouge'),
(92, 'Poids', 'Leger'),
(93, 'Couleur', 'Rouge'),
(93, 'Poids', 'Leger'),
(94, 'Couleur', 'Rouge'),
(95, 'Couleur', 'Rouge');
Gives the result:
lst gst mn_c
75 76 4 -- You can explore these matching records in the fiddle.
75 78 4 -- This was another test of edge cases to see if
76 78 4 -- multiple matching (i.e. > 2) `t_key`s show up correctly. They do
85 86 3
92 93 2
94 95 1
This is correct - 92 and 93 match, but I wanted to test a singleton as an edge case and indeed singletons which don't match are not (data not shown) - you can add a record like:
(1001, 'Poids', 'Lourd')
or any other singleton and there will be no match.
However, the absolutely CRITICAL edge case was when t_key
s with only 1 record matched. These really tested my various solutions and helped me arrive at the correct one.
Finally, the orginal question asks: I'm trying to write a query where if KEY=87 it finds KEY 85 matches all attributes and values like KEY 87
. This can be done by adding a simple WHERE
clause thus
WHERE LEAST(t1key, t2key) = 75 OR GREATEST(t1key, t2key) = 75 -- actual no. not important. It works - see fiddle.
Which gives:
lst gst mn_c
75 76 4
75 78 4
This is the correct answer. Fiddle here.
A note of interest. I ran my first solution (which worked on MySQL) though a PostgreSQL fiddle and it failed. This led me to find a solution which works with PostgreSQL and this also allowed me to improve my original MySQL answer to the point that I now believe it is very solid. PostgreSQL fiddle available here. If working with MySQL, I would recommend that you also run your solutions through a PostgreSQL fiddle.
Two posts really helped with this answer. The first was this one by @KennethFisher - it's far from identical, but Kenneth's elegant solution gave me a start - using COUNT
s and DISTINCT
s for example. At the risk of seeming to blow my own trumpet, the inspiration for using LEAST
and GREATEST
came from here, my answer to an also related question concerning matching sets of numbers.
Full test data:
INSERT INTO tab (t_key, t_name, t_value)
VALUES
(75, 'Couleur', 'Bleu'),
(75, 'Taille', 'Grand'),
(75, 'Poids', '20'),
(75, 'Teint', 'Y'),
(76, 'Couleur', 'Bleu'),
(76, 'Taille', 'Grand'),
(76, 'Poids', '20'),
(76, 'Teint', 'Y'),
(77, 'Couleur', 'Bleu'),
(77, 'Taille', 'Grand'),
(77, 'Poids', '20'),
(77, 'Teint', 'N'),
(78, 'Couleur', 'Bleu'),
(78, 'Taille', 'Grand'),
(78, 'Poids', '20'),
(78, 'Teint', 'Y'),
(85, 'Color', 'Red'),
(85, 'Size', 'Big'),
(85, 'Weight', '6'),
(86, 'Color', 'Red'),
(86, 'Size', 'Big'),
(86, 'Weight', '6'),
(87, 'Color', 'Red'),
(87, 'Size', 'Small'),
(87, 'Weight', '7'),
(92, 'Couleur', 'Rouge'),
(92, 'Poids', 'Leger'),
(93, 'Couleur', 'Rouge'),
(93, 'Poids', 'Leger'),
(94, 'Couleur', 'Rouge'),
(95, 'Couleur', 'Rouge');
-
Well, for my sample data of 31 rows, there's only 66 rows in the
CROSS JOIN
, so ~ O(2N). The OP says the range of matching attributes is 20 - 397 which should severely restrict the number of records resulting from theCROSS JOIN
AFAICS.Vérace– Vérace2019年09月08日 21:44:26 +00:00Commented Sep 8, 2019 at 21:44 -
I stand corrected.Rick James– Rick James2019年09月08日 21:47:11 +00:00Commented Sep 8, 2019 at 21:47
-
When the values are in multiple rows in a single column,
WHERE MIN(col)=MAX(col)
says "they are all the same. I wonder if that would be useful to do (instead of moving the values into multiple columns in a single row).Rick James– Rick James2019年09月08日 21:51:35 +00:00Commented Sep 8, 2019 at 21:51 -
@RickJames I added a further 60 records (2 sets of 30) with different attributes giving a total of 93 and the
CROSS JOIN
produces 160 records (< O(2N) - fiddle here). This will vary according to the nature of the attributes, but with good indexing I think performance should be OK. I've reread your comment above (When the values are in multiple rows in a single column,...
) and am still unclear. Could you possibly clarify a bit more?Vérace– Vérace2019年09月09日 06:01:02 +00:00Commented Sep 9, 2019 at 6:01 -
Think of it as another way to discover that
COUNT(DISTINCT col) = 1
Rick James– Rick James2019年09月09日 22:43:24 +00:00Commented Sep 9, 2019 at 22:43