Separating Duplicates from Uniques

Question 1

I have a MySQL table with a very large data. What I need is to find and separate duplicates from the unique rows

Let's say this is my table:

indx, data_lname, data_fname, data_mname, data_dob, data_mobile

What I'm doing right now is selecting all the rows in one query and then comparing each row with the same table using a different query. This works ok but slow as hell.

Can this be done with a single query?

private sub poplist()
 DBstrSQL = "SELECT * from tbl_data order by indx asc"
 Dim myCmd As New MySqlCommand
 myCmd.CommandTimeout = 300
 myCmd.CommandText = DBstrSQL
 myCmd.Connection = MySqlConn
 Dim myReader As MySqlDataReader = myCmd.ExecuteReader()
 If myReader.HasRows = True Then
 While myReader.Read()
 if checkifdup(myReader.GetString("indx"),myReader.GetString("data_fname"),myReader.GetString("data_mname"),myReader.GetString("data_lname"),myReader.GetString("data_dob"),myReader.GetString("data_mobile"))=false then
 With lstUnique.Items.Add(myreader.getstring("data_lname"))
 .SubItems.Add(myreader.getstring("data_fname"))
 .SubItems.Add(myreader.getstring("data_mname"))
 .SubItems.Add(myreader.getstring("data_dob"))
 .SubItems.Add(myreader.getstring("data_mobile"))
 End With
 else
 With lstDup.Items.Add(myreader.getstring("data_lname"))
 .SubItems.Add(myreader.getstring("data_fname"))
 .SubItems.Add(myreader.getstring("data_mname"))
 .SubItems.Add(myreader.getstring("data_dob"))
 .SubItems.Add(myreader.getstring("data_mobile"))
 End With
 end if
 End While
 end if
 myReader.Close()
end sub
private function checkifdup(dataindx sa string, data1 as string,data2 as string, data3 as string, data4 as string, data5 as string) as boolean
 myCmd.CommandText = "SELECT * from tbl_data where indx<>@indx and data_lname=@lname and data_mname=@mname and data_fname=@fname and (date_format(data_dob,'%m-%d-%Y')=@dob or data_mobile=@mobile) limit 1"
 myCmd.Prepare()
 myCmd.Parameters.AddWithValue("@lname", data3)
 myCmd.Parameters.AddWithValue("@mname", data2)
 myCmd.Parameters.AddWithValue("@fname", data1)
 myCmd.Parameters.AddWithValue("@dob", data4)
 myCmd.Parameters.AddWithValue("@mobile", data5)
 myCmd.Parameters.AddWithValue("@indx", dataindx)
 Dim myReader As MySqlDataReader = myCmd.ExecuteReader()
 If myReader.HasRows = True Then
 myReader.Close()
 return True
 Else
 myReader.Close()
 return false
 End If
end function

Question 2

It is working but it takes a lot of time to complete, say 1 hour for about 20k records. What im thinking is if there is another way of doing this that yields faster results. the code is quite simple, the first query is selecting a range of records, say, the first 20k rows, and then looping through each of the rows and cross checking them in another query.

Question 3

*code edited. basically if checkifdup()=false, it will add to unique table, if true, it will add to duplicate table

Question 4

You should never run a query in a loop, especially a loop where the number of queries issued scales with the size of the data. There is almost always a way to formulate the SQL such that you get the results you want with a small, fixed number of queries.

In your case, you want two queries: one to find rows that are unique (ignoring the indx column), and another to find the rows that appear more than once (ignoring the indx column). You could formulate those queries as:

SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
 FROM tbl_data
 GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
 HAVING COUNT(indx) = 1
 ORDER BY indx;
SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
 FROM tbl_data
 GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
 HAVING COUNT(indx) > 1
 ORDER BY indx;

Use those queries to populate lstUnique and lstDup, respectively.

Edit: I read the code carelessly, and misinterpreted the criteria for considering two records to be "duplicates". I would reformulate the query close to the way your VB code worked.

To find the unique records:

SELECT *
 FROM tbl_data AS a
 WHERE NOT EXISTS (
 SELECT indx
 FROM tbl_data AS b
 WHERE
 a.indx <> b.indx
 AND a.data_lname = b.data_lname
 AND a.data_fname = b.data_fname
 AND a.data_mname = b.data_mname
 AND (a.data_dob = b.data_dob OR a.data_mobile = b.data_mobile)
 );

To find the records with duplicates, change WHERE NOT EXISTS to WHERE EXISTS.

This assumes that none of the fields can have a NULL value.

For performance, be sure that indexes exist on the table. I assume that indx, being the primary key, already has a UNIQUE INDEX. In addition, there should be an index on (data_lname, data_fname, data_mname).

Question 5

I think that should do the trick. But the next problem will be the GROUP BY. You see, on my code, there are 2 criteria for duplicates, 1. same name & dob regardless of mobile#, OR 2. same name & mobile# regardless of dob. How do I do that with group by? TIA you've been really helpful so far

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Answer 1 · 2014-04-11 07:27:52Z

You should never run a query in a loop, especially a loop where the number of queries issued scales with the size of the data. There is almost always a way to formulate the SQL such that you get the results you want with a small, fixed number of queries.

In your case, you want two queries: one to find rows that are unique (ignoring the indx column), and another to find the rows that appear more than once (ignoring the indx column). You could formulate those queries as:

SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
 FROM tbl_data
 GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
 HAVING COUNT(indx) = 1
 ORDER BY indx;
SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
 FROM tbl_data
 GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
 HAVING COUNT(indx) > 1
 ORDER BY indx;

Use those queries to populate lstUnique and lstDup, respectively.

Edit: I read the code carelessly, and misinterpreted the criteria for considering two records to be "duplicates". I would reformulate the query close to the way your VB code worked.

To find the unique records:

SELECT *
 FROM tbl_data AS a
 WHERE NOT EXISTS (
 SELECT indx
 FROM tbl_data AS b
 WHERE
 a.indx <> b.indx
 AND a.data_lname = b.data_lname
 AND a.data_fname = b.data_fname
 AND a.data_mname = b.data_mname
 AND (a.data_dob = b.data_dob OR a.data_mobile = b.data_mobile)
 );

To find the records with duplicates, change WHERE NOT EXISTS to WHERE EXISTS.

This assumes that none of the fields can have a NULL value.

For performance, be sure that indexes exist on the table. I assume that indx, being the primary key, already has a UNIQUE INDEX. In addition, there should be an index on (data_lname, data_fname, data_mname).

I think that should do the trick. But the next problem will be the GROUP BY. You see, on my code, there are 2 criteria for duplicates, 1. same name & dob regardless of mobile#, OR 2. same name & mobile# regardless of dob. How do I do that with group by? TIA you've been really helpful so far

Stack Exchange Network

Separating Duplicates from Uniques

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Separating Duplicates from Uniques

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions