I have a MySQL table with a very large data. What I need is to find and separate duplicates from the unique rows
Let's say this is my table:
indx, data_lname, data_fname, data_mname, data_dob, data_mobile
What I'm doing right now is selecting all the rows in one query and then comparing each row with the same table using a different query. This works ok but slow as hell.
Can this be done with a single query?
private sub poplist()
DBstrSQL = "SELECT * from tbl_data order by indx asc"
Dim myCmd As New MySqlCommand
myCmd.CommandTimeout = 300
myCmd.CommandText = DBstrSQL
myCmd.Connection = MySqlConn
Dim myReader As MySqlDataReader = myCmd.ExecuteReader()
If myReader.HasRows = True Then
While myReader.Read()
if checkifdup(myReader.GetString("indx"),myReader.GetString("data_fname"),myReader.GetString("data_mname"),myReader.GetString("data_lname"),myReader.GetString("data_dob"),myReader.GetString("data_mobile"))=false then
With lstUnique.Items.Add(myreader.getstring("data_lname"))
.SubItems.Add(myreader.getstring("data_fname"))
.SubItems.Add(myreader.getstring("data_mname"))
.SubItems.Add(myreader.getstring("data_dob"))
.SubItems.Add(myreader.getstring("data_mobile"))
End With
else
With lstDup.Items.Add(myreader.getstring("data_lname"))
.SubItems.Add(myreader.getstring("data_fname"))
.SubItems.Add(myreader.getstring("data_mname"))
.SubItems.Add(myreader.getstring("data_dob"))
.SubItems.Add(myreader.getstring("data_mobile"))
End With
end if
End While
end if
myReader.Close()
end sub
private function checkifdup(dataindx sa string, data1 as string,data2 as string, data3 as string, data4 as string, data5 as string) as boolean
myCmd.CommandText = "SELECT * from tbl_data where indx<>@indx and data_lname=@lname and data_mname=@mname and data_fname=@fname and (date_format(data_dob,'%m-%d-%Y')=@dob or data_mobile=@mobile) limit 1"
myCmd.Prepare()
myCmd.Parameters.AddWithValue("@lname", data3)
myCmd.Parameters.AddWithValue("@mname", data2)
myCmd.Parameters.AddWithValue("@fname", data1)
myCmd.Parameters.AddWithValue("@dob", data4)
myCmd.Parameters.AddWithValue("@mobile", data5)
myCmd.Parameters.AddWithValue("@indx", dataindx)
Dim myReader As MySqlDataReader = myCmd.ExecuteReader()
If myReader.HasRows = True Then
myReader.Close()
return True
Else
myReader.Close()
return false
End If
end function
-
\$\begingroup\$ It is working but it takes a lot of time to complete, say 1 hour for about 20k records. What im thinking is if there is another way of doing this that yields faster results. the code is quite simple, the first query is selecting a range of records, say, the first 20k rows, and then looping through each of the rows and cross checking them in another query. \$\endgroup\$user40501– user405012014年04月11日 06:23:20 +00:00Commented Apr 11, 2014 at 6:23
-
\$\begingroup\$ *code edited. basically if checkifdup()=false, it will add to unique table, if true, it will add to duplicate table \$\endgroup\$user40501– user405012014年04月11日 07:07:39 +00:00Commented Apr 11, 2014 at 7:07
1 Answer 1
You should never run a query in a loop, especially a loop where the number of queries issued scales with the size of the data. There is almost always a way to formulate the SQL such that you get the results you want with a small, fixed number of queries.
In your case, you want two queries: one to find rows that are unique (ignoring the indx
column), and another to find the rows that appear more than once (ignoring the indx
column). You could formulate those queries as:
SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
FROM tbl_data
GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
HAVING COUNT(indx) = 1
ORDER BY indx;
SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
FROM tbl_data
GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
HAVING COUNT(indx) > 1
ORDER BY indx;
Use those queries to populate lstUnique
and lstDup
, respectively.
Edit: I read the code carelessly, and misinterpreted the criteria for considering two records to be "duplicates". I would reformulate the query close to the way your VB code worked.
To find the unique records:
SELECT *
FROM tbl_data AS a
WHERE NOT EXISTS (
SELECT indx
FROM tbl_data AS b
WHERE
a.indx <> b.indx
AND a.data_lname = b.data_lname
AND a.data_fname = b.data_fname
AND a.data_mname = b.data_mname
AND (a.data_dob = b.data_dob OR a.data_mobile = b.data_mobile)
);
To find the records with duplicates, change WHERE NOT EXISTS
to WHERE EXISTS
.
This assumes that none of the fields can have a NULL value.
For performance, be sure that indexes exist on the table. I assume that indx
, being the primary key, already has a UNIQUE INDEX
. In addition, there should be an index on (data_lname, data_fname, data_mname)
.
-
1\$\begingroup\$ I think that should do the trick. But the next problem will be the GROUP BY. You see, on my code, there are 2 criteria for duplicates, 1. same name & dob regardless of mobile#, OR 2. same name & mobile# regardless of dob. How do I do that with group by? TIA you've been really helpful so far \$\endgroup\$user40501– user405012014年04月11日 07:48:56 +00:00Commented Apr 11, 2014 at 7:48