MySQL check duplicate with group by using wildcard?

Question 1

 +----+--------------+-----+-----------+----------+
 | ID | NAME | AGE | ADDRESS | SALARY |
 +----+--------------+-----+-----------+----------+
 | 1 | Ramesh Olive | 32 | Ahmedabad | 2000.00 |
 | 2 | Tan Kau | 25 | Delhi | 1500.00 |
 | 3 | Jason Tan Kau| 25 | Delhi | 2000.00 |
 | 4 | Chaitali | 25 | Mumbai | 6500.00 |
 | 5 | Hardik | 27 | Bhopal | 8500.00 |
 | 6 | Hardik Jass | 27 | Bhopal | 4500.00 |
 | 7 | Muffy John | 24 | Indore | 10000.00 |
 | 8 | Muffy Lee | 24 | Indore | 10000.00 |
 +----+--------------+-----+-----------+----------+

In example above, let said the table name is "table_a" and 1) "Tan Kau" is duplicate with "Jason Tan Kau" and 2) "Hardik" is duplicate with "Hardik Jass"

How to write SQL that will produce output like below?

I think this will work but it should be very slow. Any ideas to improve this?

Select A.*, IF(B.ID IS NULL, "", "DUP") as DUP
FROM table_a A 
LEFT JOIN table_a B 
ON A.NAME LIKE CONCATE("%", B.NAME, "%") AND A.ID != B.ID
 +----+--------------+-----+-----------+----------+-----+
 | ID | NAME | AGE | ADDRESS | SALARY | DUP |
 +----+--------------+-----+-----------+----------+-----+
 | 1 | Ramesh Olive | 32 | Ahmedabad | 2000.00 | |
 | 2 | Tan Kau | 25 | Delhi | 1500.00 | Dup |
 | 3 | Jason Tan Kau| 25 | Delhi | 2000.00 | Dup |
 | 4 | Chaitali | 25 | Mumbai | 6500.00 | |
 | 5 | Hardik | 27 | Bhopal | 8500.00 | Dup |
 | 6 | Hardik Jass | 27 | Bhopal | 4500.00 | Dup | 
 | 7 | Muffy John | 24 | Indore | 10000.00 | |
 | 8 | Muffy Lee | 24 | Indore | 10000.00 | |
 +----+--------------+-----+-----------+----------+-----+

Question 2

Before touching the SQL writing keyboard, you have to set the rules for determining whether an entry is a duplicate of another one or not.

Question 3

You mean we should done the validation before running INSERT SQL?

Question 4

The reason I plan to do this was just to inform admin that the entry might be duplicate. My system should not prevent user to type what ever name they want. Because some of the user might have similar name. Anyway thank you and I really appreciates for your advice.

Question 5

I mean that in order to validate your data, you have to set up clear rules: say, are the same the following: 'Jass Hardik', 'Hardik Jass', 'Hardik M. Jass', 'Hardik Jass Muffy'? The names are, of course, nonsensical but illustrate my concern well. If you get these rules, you can transform them into SQL or whatever.

Question 6

Thank you. If the full name appear in another row, then I will consider it as duplicate. So Hardik will be duplicate to 'Jass Hardik', 'Hardik Jass', 'Hardik M. Jass', 'Hardik Jass Muffy'

Question 7

Your query can return the expected results by adding the reverse condition:

SELECT A.*, IF(B.ID IS NULL, "", "DUP") as DUP
FROM persons A 
LEFT JOIN persons B 
ON a.ID <> b.ID 
AND (a.Name LIKE CONCAT ("%", b.Name, "%") OR b.Name LIKE CONCAT ("%", a.Name, "%"))
ORDER BY ID;

I don't know if it will be faster, but another way to do it would be to use INSTR:

SELECT A.*, IF(B.ID IS NULL, "", "DUP") as DUP
FROM persons A 
LEFT JOIN persons B 
ON a.ID <> b.ID 
AND (Instr(a.Name, b.Name) > 0 OR Instr(b.Name, a.Name) > 0)
ORDER BY ID;

^{SQL Fiddle}

Question 8

I did something a little different

SELECT DISTINCT AA.* FROM
(
 SELECT A.*,IF(IFNULL(B.ID,'')='','','Dup') DUP
 FROM table_a A LEFT JOIN table_a B ON a.ID <> b.ID 
 AND IF(LENGTH(A.name)>LENGTH(B.name),
 INSTR(A.name,B.name)>0,
 INSTR(B.name,A.name)>0)
) AA;

NOTE : I basically plagiarized Leigh's answer and augmented it slightly, so please do not mark my answer as accepted !!!

The reason I gave this is in case there are more dups present

Here is your sample data plus two extra rows:

mysql> DROP DATABASE IF EXISTS cww;
Query OK, 1 row affected (0.03 sec)
mysql> CREATE DATABASE cww;
Query OK, 1 row affected (0.00 sec)
mysql> USE cww
Database changed
mysql> CREATE TABLE table_a
 -> (
 -> ID INT NOT NULL AUTO_INCREMENT,
 -> NAME VARCHAR(25) NOT NULL,
 -> AGE INT NOT NULL,
 -> ADDRESS VARCHAR(25) NOT NULL,
 -> SALARY DECIMAL(10,2) NOT NULL,
 -> PRIMARY KEY (ID)
 -> );
Query OK, 0 rows affected (0.10 sec)
mysql> INSERT INTO table_a (NAME,AGE,ADDRESS,SALARY) VALUES
 -> ('Ramesh Olive' ,32,'Ahmedabad', 2000.00),
 -> ('Tan Kau' ,25,'Delhi' , 1500.00),
 -> ('Jason Tan Kau' ,25,'Delhi' , 2000.00),
 -> ('Jackson Tan Kau',25,'Delhi' , 2000.00),
 -> ('Chaitali' ,25,'Mumbai' , 6500.00),
 -> ('Hardik' ,27,'Bhopal' , 8500.00),
 -> ('Hardik Jass' ,27,'Bhopal' , 4500.00),
 -> ('Hardik Jess' ,27,'Bhopal' , 4500.00),
 -> ('Muffy John' ,24,'Indore' , 10000.00),
 -> ('Muffy Lee' ,24,'Indore' , 10000.00);
Query OK, 10 rows affected (0.05 sec)
Records: 10 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM table_a;
+----+-----------------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+-----------------+-----+-----------+----------+
| 1 | Ramesh Olive | 32 | Ahmedabad | 2000.00 |
| 2 | Tan Kau | 25 | Delhi | 1500.00 |
| 3 | Jason Tan Kau | 25 | Delhi | 2000.00 |
| 4 | Jackson Tan Kau | 25 | Delhi | 2000.00 |
| 5 | Chaitali | 25 | Mumbai | 6500.00 |
| 6 | Hardik | 27 | Bhopal | 8500.00 |
| 7 | Hardik Jass | 27 | Bhopal | 4500.00 |
| 8 | Hardik Jess | 27 | Bhopal | 4500.00 |
| 9 | Muffy John | 24 | Indore | 10000.00 |
| 10 | Muffy Lee | 24 | Indore | 10000.00 |
+----+-----------------+-----+-----------+----------+
10 rows in set (0.00 sec)
mysql>

Notice how my augmented query handles the dups properly

mysql> SELECT DISTINCT AA.* FROM
 -> (
 -> SELECT A.*,IF(IFNULL(B.ID,'')='','','Dup') DUP
 -> FROM table_a A LEFT JOIN table_a B ON a.ID <> b.ID
 -> AND IF(LENGTH(A.name)>LENGTH(B.name),
 -> INSTR(A.name,B.name)>0,
 -> INSTR(B.name,A.name)>0)
 -> ) AA;
+----+-----------------+-----+-----------+----------+-----+
| ID | NAME | AGE | ADDRESS | SALARY | DUP |
+----+-----------------+-----+-----------+----------+-----+
| 1 | Ramesh Olive | 32 | Ahmedabad | 2000.00 | |
| 2 | Tan Kau | 25 | Delhi | 1500.00 | Dup |
| 3 | Jason Tan Kau | 25 | Delhi | 2000.00 | Dup |
| 4 | Jackson Tan Kau | 25 | Delhi | 2000.00 | Dup |
| 5 | Chaitali | 25 | Mumbai | 6500.00 | |
| 6 | Hardik | 27 | Bhopal | 8500.00 | Dup |
| 7 | Hardik Jass | 27 | Bhopal | 4500.00 | Dup |
| 8 | Hardik Jess | 27 | Bhopal | 4500.00 | Dup |
| 9 | Muffy John | 24 | Indore | 10000.00 | |
| 10 | Muffy Lee | 24 | Indore | 10000.00 | |
+----+-----------------+-----+-----------+----------+-----+
10 rows in set (0.00 sec)
mysql>

In the face of more dups, Leigh's query does this:

mysql> SELECT A.*, IF(B.ID IS NULL, "", "DUP") as DUP
 -> FROM table_a A
 -> LEFT JOIN table_a B
 -> ON a.ID <> b.ID
 -> AND (Instr(a.Name, b.Name) > 0 OR Instr(b.Name, a.Name) > 0)
 -> ORDER BY ID;
+----+-----------------+-----+-----------+----------+-----+
| ID | NAME | AGE | ADDRESS | SALARY | DUP |
+----+-----------------+-----+-----------+----------+-----+
| 1 | Ramesh Olive | 32 | Ahmedabad | 2000.00 | |
| 2 | Tan Kau | 25 | Delhi | 1500.00 | DUP |
| 2 | Tan Kau | 25 | Delhi | 1500.00 | DUP |
| 3 | Jason Tan Kau | 25 | Delhi | 2000.00 | DUP |
| 4 | Jackson Tan Kau | 25 | Delhi | 2000.00 | DUP |
| 5 | Chaitali | 25 | Mumbai | 6500.00 | |
| 6 | Hardik | 27 | Bhopal | 8500.00 | DUP |
| 6 | Hardik | 27 | Bhopal | 8500.00 | DUP |
| 7 | Hardik Jass | 27 | Bhopal | 4500.00 | DUP |
| 8 | Hardik Jess | 27 | Bhopal | 4500.00 | DUP |
| 9 | Muffy John | 24 | Indore | 10000.00 | |
| 10 | Muffy Lee | 24 | Indore | 10000.00 | |
+----+-----------------+-----+-----------+----------+-----+
12 rows in set (0.00 sec)
mysql>

@LeighRiffel's answer just needs to be imbedded in a subquery and made DISTINCT:

mysql> SELECT DISTINCT * FROM (
 -> SELECT A.*, IF(B.ID IS NULL, "", "DUP") as DUP
 -> FROM table_a A
 -> LEFT JOIN table_a B
 -> ON a.ID <> b.ID
 -> AND (Instr(a.Name, b.Name) > 0 OR Instr(b.Name, a.Name) > 0)
 -> ORDER BY ID) AA;
+----+-----------------+-----+-----------+----------+-----+
| ID | NAME | AGE | ADDRESS | SALARY | DUP |
+----+-----------------+-----+-----------+----------+-----+
| 1 | Ramesh Olive | 32 | Ahmedabad | 2000.00 | |
| 2 | Tan Kau | 25 | Delhi | 1500.00 | DUP |
| 3 | Jason Tan Kau | 25 | Delhi | 2000.00 | DUP |
| 4 | Jackson Tan Kau | 25 | Delhi | 2000.00 | DUP |
| 5 | Chaitali | 25 | Mumbai | 6500.00 | |
| 6 | Hardik | 27 | Bhopal | 8500.00 | DUP |
| 7 | Hardik Jass | 27 | Bhopal | 4500.00 | DUP |
| 8 | Hardik Jess | 27 | Bhopal | 4500.00 | DUP |
| 9 | Muffy John | 24 | Indore | 10000.00 | |
| 10 | Muffy Lee | 24 | Indore | 10000.00 | |
+----+-----------------+-----+-----------+----------+-----+
10 rows in set (0.00 sec)
mysql>

Nevertheless, Leigh's answer did provide the needed SQL principles upfront.

Therefore, I give him a +1 !!!

Question 9

I think add a group by should solve the duplicate for Leigh's answer.

SELECT A.*, IF(B.ID IS NULL, "", "DUP") as DUP FROM persons A LEFT JOIN persons B ON a.ID <> b.ID AND (Instr(a.Name, b.Name) > 0 OR Instr(b.Name, a.Name) > 0) GROUP BY a.ID ORDER BY ID;

Question 10

I am curious about the IF conditions in your SQL. Will it work faster since it reduce another condition "OR"? IF(LENGTH(A.name)>LENGTH(B.name), INSTR(A.name,B.name)>0, INSTR(B.name,A.name)>0)

Question 11

@Rolando - You are correct that a DISTINCT is necessary if there is more than one match for an entry and this seems likely, so good catch. However, a subquery is not required as the DISTINCT will work just fine on the base query.

Leigh Riffel Leigh Riffel 23.9k17 gold badges80 silver badges155 bronze badges · Accepted Answer · 2012-11-14 19:51:40Z

Your query can return the expected results by adding the reverse condition:

SELECT A.*, IF(B.ID IS NULL, "", "DUP") as DUP
FROM persons A 
LEFT JOIN persons B 
ON a.ID <> b.ID 
AND (a.Name LIKE CONCAT ("%", b.Name, "%") OR b.Name LIKE CONCAT ("%", a.Name, "%"))
ORDER BY ID;

I don't know if it will be faster, but another way to do it would be to use INSTR:

SELECT A.*, IF(B.ID IS NULL, "", "DUP") as DUP
FROM persons A 
LEFT JOIN persons B 
ON a.ID <> b.ID 
AND (Instr(a.Name, b.Name) > 0 OR Instr(b.Name, a.Name) > 0)
ORDER BY ID;

^{SQL Fiddle}

Stack Exchange Network

MySQL check duplicate with group by using wildcard?

2 Answers 2

NOTE : I basically plagiarized Leigh's answer and augmented it slightly, so please do not mark my answer as accepted !!!

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

MySQL check duplicate with group by using wildcard?

2 Answers 2

NOTE : I basically plagiarized Leigh's answer and augmented it slightly, so please do not mark my answer as accepted !!!

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions