How to best detect duplicate data in a large dataset

Question 1

I recently heard of the statistic "87% of the US population can be uniquely identified by a tuple of their zip code, birth date and gender". This is apparently not true, and I was wondering how I would verify it if I had the census data. So imagining I had a 300-millions-line-long unsorted text file containing the gender, zip code and birth date of each person living in the US, what would be the quickest way of knowing what percentage of the population is uniquely identifiable by that tuple?

This should be a matter of identifying what percentage of the entries are duplicated in the dataset, but what would be a good way to go about it? I'm interested in useful algorithms and efficient data structures, and speed is more important than memory consumption as long as the latter is kept to a reasonable level.

Question 2

Probably inserting all that in a database and compose a query using something like select zipCode, birthDate, gender, count(1) from censusData group by zipCode, birthDate, gender

Question 3

Yep. That would do it.

Question 4

It may be worth noting that things like this are time sensitive. What might have been a nearly unique identifier in 1990 is very likely less so in 2000. Due to the pigeonhole principle, as the population grows the bigger your composite key has to get.

Question 5

SQL solution

You could load all the demographic data into an SQL database:

CREATE TABLE PERSON(Id integer PRIMARY KEY, zip text, birth date, gender char /*... */);
...

Unfortunately the file importing statement is not SQL standard (e.g. BULK INSERT for SQLServer, LOAD DATA INFILE for mysql, or use SQL*Loader for Oracle).

The easiest and most efficient way would then be to use aggregate functions with a GROUP BY clause to count number of persons sharing the same values for the grouping columns, and keeping only those with duplicates, using a HAVING clause:

SELECT zip, birth, gender, count(*) FROM PERSON 
 GROUP BY zip, birth, gender
 HAVING count(*)>1;

Online demo

Sorted file solution

You could als get your census file sorted by zip, birth and gender. Then you could read the data, compare each record read to the previous one, and if the same, and count until these value change for a record.

Pseudocode:

lastrecord = { };
counter = 1; 
while there's a record to read {
 read record 
 if (record.zip == lastrecord.zip 
 and record.birth==lastreacord.birth 
 and record.gender == lastrecord.gender) {
 counter = counter +1; 
 } 
 else {
 if (counter>1) { // output the count of duplicates
 write lastrecord.zip, lastrecord.birth, lastrecord.gender, counter
 } 
 counter =1; 
 } 
 lastrecord = record; 
}
if (counter>1) { // output the count of duplicates
 write lastrecord.zip, lastrecord.birth, lastrecord.gender, 
}

Associative map

A last way, here would be to read each record as it comes, and store the 3 tuple values in a map:

store 1 if the tuple was not yet loaded
increment existing tuple value if it already exists

In the end, iterate trough the map and process the elements having a count greater than 1. Ok, this one will cost you some memory ;-)

Question 6

Not sure this would fit in memory
pseudo code

you could pack it all into one integer
perfect hash 12345YYYYMMDD0 , 12345YYYYMMDD1 Dictionary

 Dictionary<string, int> dic = new ....
 (while zipbirhsex from file.ReadLine)
 {
 if(dic.ContainsKey[zipbirhsex])
 dic.[zipbirhsex]++;
 else 
 dic.Add(zipbirhsex, 1);
 }
 Dictionary<int, int> dic2 = new ...
 foreach(kvp in dic)
 {
 if(dic2.ContainsKey[kvp.Value])
 dic2.[kvp.Value]++;
 else 
 dic2.Add(kvp.Value, 1);
 }

in sql I would use int to save space convert dates to yyyymmdd

CREATE TABLE PERSON(int zip, int date, bit Sex);
insert ...
select zip, date, Sex, count(*) 
from person 
GROUP BY (zip, DATE, SEX)

Christophe Christophe 81.9k11 gold badges135 silver badges201 bronze badges · Answer 1 · 2016-09-15 19:52:46Z

SQL solution

You could load all the demographic data into an SQL database:

CREATE TABLE PERSON(Id integer PRIMARY KEY, zip text, birth date, gender char /*... */);
...

Unfortunately the file importing statement is not SQL standard (e.g. BULK INSERT for SQLServer, LOAD DATA INFILE for mysql, or use SQL*Loader for Oracle).

The easiest and most efficient way would then be to use aggregate functions with a GROUP BY clause to count number of persons sharing the same values for the grouping columns, and keeping only those with duplicates, using a HAVING clause:

SELECT zip, birth, gender, count(*) FROM PERSON 
 GROUP BY zip, birth, gender
 HAVING count(*)>1;

Online demo

Sorted file solution

You could als get your census file sorted by zip, birth and gender. Then you could read the data, compare each record read to the previous one, and if the same, and count until these value change for a record.

Pseudocode:

lastrecord = { };
counter = 1; 
while there's a record to read {
 read record 
 if (record.zip == lastrecord.zip 
 and record.birth==lastreacord.birth 
 and record.gender == lastrecord.gender) {
 counter = counter +1; 
 } 
 else {
 if (counter>1) { // output the count of duplicates
 write lastrecord.zip, lastrecord.birth, lastrecord.gender, counter
 } 
 counter =1; 
 } 
 lastrecord = record; 
}
if (counter>1) { // output the count of duplicates
 write lastrecord.zip, lastrecord.birth, lastrecord.gender, 
}

Associative map

A last way, here would be to read each record as it comes, and store the 3 tuple values in a map:

store 1 if the tuple was not yet loaded
increment existing tuple value if it already exists

In the end, iterate trough the map and process the elements having a count greater than 1. Ok, this one will cost you some memory ;-)

paparazzo paparazzo 1,9271 gold badge14 silver badges23 bronze badges · Answer 2 · 2016-09-15 20:15:15Z

Not sure this would fit in memory
pseudo code

you could pack it all into one integer
perfect hash 12345YYYYMMDD0 , 12345YYYYMMDD1 Dictionary

 Dictionary<string, int> dic = new ....
 (while zipbirhsex from file.ReadLine)
 {
 if(dic.ContainsKey[zipbirhsex])
 dic.[zipbirhsex]++;
 else 
 dic.Add(zipbirhsex, 1);
 }
 Dictionary<int, int> dic2 = new ...
 foreach(kvp in dic)
 {
 if(dic2.ContainsKey[kvp.Value])
 dic2.[kvp.Value]++;
 else 
 dic2.Add(kvp.Value, 1);
 }

in sql I would use int to save space convert dates to yyyymmdd

CREATE TABLE PERSON(int zip, int date, bit Sex);
insert ...
select zip, date, Sex, count(*) 
from person 
GROUP BY (zip, DATE, SEX)

Stack Exchange Network

How to best detect duplicate data in a large dataset

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to best detect duplicate data in a large dataset

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions