5
 +----+--------------+-----+-----------+----------+
 | ID | NAME | AGE | ADDRESS | SALARY |
 +----+--------------+-----+-----------+----------+
 | 1 | Ramesh Olive | 32 | Ahmedabad | 2000.00 |
 | 2 | Tan Kau | 25 | Delhi | 1500.00 |
 | 3 | Jason Tan Kau| 25 | Delhi | 2000.00 |
 | 4 | Chaitali | 25 | Mumbai | 6500.00 |
 | 5 | Hardik | 27 | Bhopal | 8500.00 |
 | 6 | Hardik Jass | 27 | Bhopal | 4500.00 |
 | 7 | Muffy John | 24 | Indore | 10000.00 |
 | 8 | Muffy Lee | 24 | Indore | 10000.00 |
 +----+--------------+-----+-----------+----------+

In example above, let said the table name is "table_a" and 1) "Tan Kau" is duplicate with "Jason Tan Kau" and 2) "Hardik" is duplicate with "Hardik Jass"

How to write SQL that will produce output like below?

I think this will work but it should be very slow. Any ideas to improve this?

Select A.*, IF(B.ID IS NULL, "", "DUP") as DUP
FROM table_a A 
LEFT JOIN table_a B 
ON A.NAME LIKE CONCATE("%", B.NAME, "%") AND A.ID != B.ID
 +----+--------------+-----+-----------+----------+-----+
 | ID | NAME | AGE | ADDRESS | SALARY | DUP |
 +----+--------------+-----+-----------+----------+-----+
 | 1 | Ramesh Olive | 32 | Ahmedabad | 2000.00 | |
 | 2 | Tan Kau | 25 | Delhi | 1500.00 | Dup |
 | 3 | Jason Tan Kau| 25 | Delhi | 2000.00 | Dup |
 | 4 | Chaitali | 25 | Mumbai | 6500.00 | |
 | 5 | Hardik | 27 | Bhopal | 8500.00 | Dup |
 | 6 | Hardik Jass | 27 | Bhopal | 4500.00 | Dup | 
 | 7 | Muffy John | 24 | Indore | 10000.00 | |
 | 8 | Muffy Lee | 24 | Indore | 10000.00 | |
 +----+--------------+-----+-----------+----------+-----+
RolandoMySQLDBA
185k34 gold badges327 silver badges541 bronze badges
asked Nov 14, 2012 at 6:20
7
  • Before touching the SQL writing keyboard, you have to set the rules for determining whether an entry is a duplicate of another one or not. Commented Nov 14, 2012 at 9:09
  • You mean we should done the validation before running INSERT SQL? Commented Nov 14, 2012 at 12:12
  • The reason I plan to do this was just to inform admin that the entry might be duplicate. My system should not prevent user to type what ever name they want. Because some of the user might have similar name. Anyway thank you and I really appreciates for your advice. Commented Nov 14, 2012 at 12:21
  • 1
    I mean that in order to validate your data, you have to set up clear rules: say, are the same the following: 'Jass Hardik', 'Hardik Jass', 'Hardik M. Jass', 'Hardik Jass Muffy'? The names are, of course, nonsensical but illustrate my concern well. If you get these rules, you can transform them into SQL or whatever. Commented Nov 14, 2012 at 12:28
  • Thank you. If the full name appear in another row, then I will consider it as duplicate. So Hardik will be duplicate to 'Jass Hardik', 'Hardik Jass', 'Hardik M. Jass', 'Hardik Jass Muffy' Commented Nov 14, 2012 at 12:53

2 Answers 2

3

Your query can return the expected results by adding the reverse condition:

SELECT A.*, IF(B.ID IS NULL, "", "DUP") as DUP
FROM persons A 
LEFT JOIN persons B 
ON a.ID <> b.ID 
AND (a.Name LIKE CONCAT ("%", b.Name, "%") OR b.Name LIKE CONCAT ("%", a.Name, "%"))
ORDER BY ID;

I don't know if it will be faster, but another way to do it would be to use INSTR:

SELECT A.*, IF(B.ID IS NULL, "", "DUP") as DUP
FROM persons A 
LEFT JOIN persons B 
ON a.ID <> b.ID 
AND (Instr(a.Name, b.Name) > 0 OR Instr(b.Name, a.Name) > 0)
ORDER BY ID;

SQL Fiddle

answered Nov 14, 2012 at 19:51
0
1

I did something a little different

SELECT DISTINCT AA.* FROM
(
 SELECT A.*,IF(IFNULL(B.ID,'')='','','Dup') DUP
 FROM table_a A LEFT JOIN table_a B ON a.ID <> b.ID 
 AND IF(LENGTH(A.name)>LENGTH(B.name),
 INSTR(A.name,B.name)>0,
 INSTR(B.name,A.name)>0)
) AA;

NOTE : I basically plagiarized Leigh's answer and augmented it slightly, so please do not mark my answer as accepted !!!

The reason I gave this is in case there are more dups present

Here is your sample data plus two extra rows:

mysql> DROP DATABASE IF EXISTS cww;
Query OK, 1 row affected (0.03 sec)
mysql> CREATE DATABASE cww;
Query OK, 1 row affected (0.00 sec)
mysql> USE cww
Database changed
mysql> CREATE TABLE table_a
 -> (
 -> ID INT NOT NULL AUTO_INCREMENT,
 -> NAME VARCHAR(25) NOT NULL,
 -> AGE INT NOT NULL,
 -> ADDRESS VARCHAR(25) NOT NULL,
 -> SALARY DECIMAL(10,2) NOT NULL,
 -> PRIMARY KEY (ID)
 -> );
Query OK, 0 rows affected (0.10 sec)
mysql> INSERT INTO table_a (NAME,AGE,ADDRESS,SALARY) VALUES
 -> ('Ramesh Olive' ,32,'Ahmedabad', 2000.00),
 -> ('Tan Kau' ,25,'Delhi' , 1500.00),
 -> ('Jason Tan Kau' ,25,'Delhi' , 2000.00),
 -> ('Jackson Tan Kau',25,'Delhi' , 2000.00),
 -> ('Chaitali' ,25,'Mumbai' , 6500.00),
 -> ('Hardik' ,27,'Bhopal' , 8500.00),
 -> ('Hardik Jass' ,27,'Bhopal' , 4500.00),
 -> ('Hardik Jess' ,27,'Bhopal' , 4500.00),
 -> ('Muffy John' ,24,'Indore' , 10000.00),
 -> ('Muffy Lee' ,24,'Indore' , 10000.00);
Query OK, 10 rows affected (0.05 sec)
Records: 10 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM table_a;
+----+-----------------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+-----------------+-----+-----------+----------+
| 1 | Ramesh Olive | 32 | Ahmedabad | 2000.00 |
| 2 | Tan Kau | 25 | Delhi | 1500.00 |
| 3 | Jason Tan Kau | 25 | Delhi | 2000.00 |
| 4 | Jackson Tan Kau | 25 | Delhi | 2000.00 |
| 5 | Chaitali | 25 | Mumbai | 6500.00 |
| 6 | Hardik | 27 | Bhopal | 8500.00 |
| 7 | Hardik Jass | 27 | Bhopal | 4500.00 |
| 8 | Hardik Jess | 27 | Bhopal | 4500.00 |
| 9 | Muffy John | 24 | Indore | 10000.00 |
| 10 | Muffy Lee | 24 | Indore | 10000.00 |
+----+-----------------+-----+-----------+----------+
10 rows in set (0.00 sec)
mysql>

Notice how my augmented query handles the dups properly

mysql> SELECT DISTINCT AA.* FROM
 -> (
 -> SELECT A.*,IF(IFNULL(B.ID,'')='','','Dup') DUP
 -> FROM table_a A LEFT JOIN table_a B ON a.ID <> b.ID
 -> AND IF(LENGTH(A.name)>LENGTH(B.name),
 -> INSTR(A.name,B.name)>0,
 -> INSTR(B.name,A.name)>0)
 -> ) AA;
+----+-----------------+-----+-----------+----------+-----+
| ID | NAME | AGE | ADDRESS | SALARY | DUP |
+----+-----------------+-----+-----------+----------+-----+
| 1 | Ramesh Olive | 32 | Ahmedabad | 2000.00 | |
| 2 | Tan Kau | 25 | Delhi | 1500.00 | Dup |
| 3 | Jason Tan Kau | 25 | Delhi | 2000.00 | Dup |
| 4 | Jackson Tan Kau | 25 | Delhi | 2000.00 | Dup |
| 5 | Chaitali | 25 | Mumbai | 6500.00 | |
| 6 | Hardik | 27 | Bhopal | 8500.00 | Dup |
| 7 | Hardik Jass | 27 | Bhopal | 4500.00 | Dup |
| 8 | Hardik Jess | 27 | Bhopal | 4500.00 | Dup |
| 9 | Muffy John | 24 | Indore | 10000.00 | |
| 10 | Muffy Lee | 24 | Indore | 10000.00 | |
+----+-----------------+-----+-----------+----------+-----+
10 rows in set (0.00 sec)
mysql>

In the face of more dups, Leigh's query does this:

mysql> SELECT A.*, IF(B.ID IS NULL, "", "DUP") as DUP
 -> FROM table_a A
 -> LEFT JOIN table_a B
 -> ON a.ID <> b.ID
 -> AND (Instr(a.Name, b.Name) > 0 OR Instr(b.Name, a.Name) > 0)
 -> ORDER BY ID;
+----+-----------------+-----+-----------+----------+-----+
| ID | NAME | AGE | ADDRESS | SALARY | DUP |
+----+-----------------+-----+-----------+----------+-----+
| 1 | Ramesh Olive | 32 | Ahmedabad | 2000.00 | |
| 2 | Tan Kau | 25 | Delhi | 1500.00 | DUP |
| 2 | Tan Kau | 25 | Delhi | 1500.00 | DUP |
| 3 | Jason Tan Kau | 25 | Delhi | 2000.00 | DUP |
| 4 | Jackson Tan Kau | 25 | Delhi | 2000.00 | DUP |
| 5 | Chaitali | 25 | Mumbai | 6500.00 | |
| 6 | Hardik | 27 | Bhopal | 8500.00 | DUP |
| 6 | Hardik | 27 | Bhopal | 8500.00 | DUP |
| 7 | Hardik Jass | 27 | Bhopal | 4500.00 | DUP |
| 8 | Hardik Jess | 27 | Bhopal | 4500.00 | DUP |
| 9 | Muffy John | 24 | Indore | 10000.00 | |
| 10 | Muffy Lee | 24 | Indore | 10000.00 | |
+----+-----------------+-----+-----------+----------+-----+
12 rows in set (0.00 sec)
mysql>

@LeighRiffel's answer just needs to be imbedded in a subquery and made DISTINCT:

mysql> SELECT DISTINCT * FROM (
 -> SELECT A.*, IF(B.ID IS NULL, "", "DUP") as DUP
 -> FROM table_a A
 -> LEFT JOIN table_a B
 -> ON a.ID <> b.ID
 -> AND (Instr(a.Name, b.Name) > 0 OR Instr(b.Name, a.Name) > 0)
 -> ORDER BY ID) AA;
+----+-----------------+-----+-----------+----------+-----+
| ID | NAME | AGE | ADDRESS | SALARY | DUP |
+----+-----------------+-----+-----------+----------+-----+
| 1 | Ramesh Olive | 32 | Ahmedabad | 2000.00 | |
| 2 | Tan Kau | 25 | Delhi | 1500.00 | DUP |
| 3 | Jason Tan Kau | 25 | Delhi | 2000.00 | DUP |
| 4 | Jackson Tan Kau | 25 | Delhi | 2000.00 | DUP |
| 5 | Chaitali | 25 | Mumbai | 6500.00 | |
| 6 | Hardik | 27 | Bhopal | 8500.00 | DUP |
| 7 | Hardik Jass | 27 | Bhopal | 4500.00 | DUP |
| 8 | Hardik Jess | 27 | Bhopal | 4500.00 | DUP |
| 9 | Muffy John | 24 | Indore | 10000.00 | |
| 10 | Muffy Lee | 24 | Indore | 10000.00 | |
+----+-----------------+-----+-----------+----------+-----+
10 rows in set (0.00 sec)
mysql>

Nevertheless, Leigh's answer did provide the needed SQL principles upfront.

Therefore, I give him a +1 !!!

answered Nov 14, 2012 at 21:47
3
  • I think add a group by should solve the duplicate for Leigh's answer. SELECT A.*, IF(B.ID IS NULL, "", "DUP") as DUP FROM persons A LEFT JOIN persons B ON a.ID <> b.ID AND (Instr(a.Name, b.Name) > 0 OR Instr(b.Name, a.Name) > 0) GROUP BY a.ID ORDER BY ID; Commented Nov 15, 2012 at 5:33
  • I am curious about the IF conditions in your SQL. Will it work faster since it reduce another condition "OR"? IF(LENGTH(A.name)>LENGTH(B.name), INSTR(A.name,B.name)>0, INSTR(B.name,A.name)>0) Commented Nov 15, 2012 at 5:37
  • @Rolando - You are correct that a DISTINCT is necessary if there is more than one match for an entry and this seems likely, so good catch. However, a subquery is not required as the DISTINCT will work just fine on the base query. Commented Nov 15, 2012 at 13:20

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.