5
\$\begingroup\$

I have a MySQL table with a very large data. What I need is to find and separate duplicates from the unique rows

Let's say this is my table:

indx, data_lname, data_fname, data_mname, data_dob, data_mobile

What I'm doing right now is selecting all the rows in one query and then comparing each row with the same table using a different query. This works ok but slow as hell.

Can this be done with a single query?

private sub poplist()
 DBstrSQL = "SELECT * from tbl_data order by indx asc"
 Dim myCmd As New MySqlCommand
 myCmd.CommandTimeout = 300
 myCmd.CommandText = DBstrSQL
 myCmd.Connection = MySqlConn
 Dim myReader As MySqlDataReader = myCmd.ExecuteReader()
 If myReader.HasRows = True Then
 While myReader.Read()
 if checkifdup(myReader.GetString("indx"),myReader.GetString("data_fname"),myReader.GetString("data_mname"),myReader.GetString("data_lname"),myReader.GetString("data_dob"),myReader.GetString("data_mobile"))=false then
 With lstUnique.Items.Add(myreader.getstring("data_lname"))
 .SubItems.Add(myreader.getstring("data_fname"))
 .SubItems.Add(myreader.getstring("data_mname"))
 .SubItems.Add(myreader.getstring("data_dob"))
 .SubItems.Add(myreader.getstring("data_mobile"))
 End With
 else
 With lstDup.Items.Add(myreader.getstring("data_lname"))
 .SubItems.Add(myreader.getstring("data_fname"))
 .SubItems.Add(myreader.getstring("data_mname"))
 .SubItems.Add(myreader.getstring("data_dob"))
 .SubItems.Add(myreader.getstring("data_mobile"))
 End With
 end if
 End While
 end if
 myReader.Close()
end sub
private function checkifdup(dataindx sa string, data1 as string,data2 as string, data3 as string, data4 as string, data5 as string) as boolean
 myCmd.CommandText = "SELECT * from tbl_data where indx<>@indx and data_lname=@lname and data_mname=@mname and data_fname=@fname and (date_format(data_dob,'%m-%d-%Y')=@dob or data_mobile=@mobile) limit 1"
 myCmd.Prepare()
 myCmd.Parameters.AddWithValue("@lname", data3)
 myCmd.Parameters.AddWithValue("@mname", data2)
 myCmd.Parameters.AddWithValue("@fname", data1)
 myCmd.Parameters.AddWithValue("@dob", data4)
 myCmd.Parameters.AddWithValue("@mobile", data5)
 myCmd.Parameters.AddWithValue("@indx", dataindx)
 Dim myReader As MySqlDataReader = myCmd.ExecuteReader()
 If myReader.HasRows = True Then
 myReader.Close()
 return True
 Else
 myReader.Close()
 return false
 End If
end function
Mast
13.8k12 gold badges56 silver badges127 bronze badges
asked Apr 11, 2014 at 6:12
\$\endgroup\$
2
  • \$\begingroup\$ It is working but it takes a lot of time to complete, say 1 hour for about 20k records. What im thinking is if there is another way of doing this that yields faster results. the code is quite simple, the first query is selecting a range of records, say, the first 20k rows, and then looping through each of the rows and cross checking them in another query. \$\endgroup\$ Commented Apr 11, 2014 at 6:23
  • \$\begingroup\$ *code edited. basically if checkifdup()=false, it will add to unique table, if true, it will add to duplicate table \$\endgroup\$ Commented Apr 11, 2014 at 7:07

1 Answer 1

2
\$\begingroup\$

You should never run a query in a loop, especially a loop where the number of queries issued scales with the size of the data. There is almost always a way to formulate the SQL such that you get the results you want with a small, fixed number of queries.

In your case, you want two queries: one to find rows that are unique (ignoring the indx column), and another to find the rows that appear more than once (ignoring the indx column). You could formulate those queries as:

SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
 FROM tbl_data
 GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
 HAVING COUNT(indx) = 1
 ORDER BY indx;
SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
 FROM tbl_data
 GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
 HAVING COUNT(indx) > 1
 ORDER BY indx;

Use those queries to populate lstUnique and lstDup, respectively.


Edit: I read the code carelessly, and misinterpreted the criteria for considering two records to be "duplicates". I would reformulate the query close to the way your VB code worked.

To find the unique records:

SELECT *
 FROM tbl_data AS a
 WHERE NOT EXISTS (
 SELECT indx
 FROM tbl_data AS b
 WHERE
 a.indx <> b.indx
 AND a.data_lname = b.data_lname
 AND a.data_fname = b.data_fname
 AND a.data_mname = b.data_mname
 AND (a.data_dob = b.data_dob OR a.data_mobile = b.data_mobile)
 );

To find the records with duplicates, change WHERE NOT EXISTS to WHERE EXISTS.

This assumes that none of the fields can have a NULL value.


For performance, be sure that indexes exist on the table. I assume that indx, being the primary key, already has a UNIQUE INDEX. In addition, there should be an index on (data_lname, data_fname, data_mname).

answered Apr 11, 2014 at 7:27
\$\endgroup\$
1
  • 1
    \$\begingroup\$ I think that should do the trick. But the next problem will be the GROUP BY. You see, on my code, there are 2 criteria for duplicates, 1. same name & dob regardless of mobile#, OR 2. same name & mobile# regardless of dob. How do I do that with group by? TIA you've been really helpful so far \$\endgroup\$ Commented Apr 11, 2014 at 7:48

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.