I work for an organization that has lots of databases containing person information. The data quality is poor. One case was a surname I found like this (this is the worst-case scenario):
Mark "Dunno his surname, but it sounded like Lion 'RAR', Ha ha"
There is a date of birth of: 01/09/1499
This is a worst-case scenario. Most data quality issues are due to pressing a wrong key on the keyboard e.g. Snith instead of Smith (n is next me m on the keyboard).
I am looking for algorithms which can help me with some kind of "fuzzy matching" under these circumstances. Our requirements involve several millions of records per day. I have looked for "data matching" and discovered the following algorithms:
SQL SOUNDEX
SQL METAPHONE
Levenshtein Distance
Also is there such thing as a possible match for dates of birth? A possible match for a surname were the Levenstein distance is 80%.
Therefore I have two questions:
What algorithms are available except the three specified above?
What approaches are used to match possible addresses
-
2Hmm, that I know of, there is no equivalent for birth dates, though a person is far more likely to be off by a digit (and not a day), so a good starting point is something that checks for the existence of birth dates whose digits are off by one.Neil– Neil2014年12月11日 13:21:03 +00:00Commented Dec 11, 2014 at 13:21
-
1Asking for a "source of information" means asking for an off-site resource, which is off-topic according to the rules of this site. If you delete that part from your question make it more specificially, I see a chance for it not to be closed soon (for example: are you going to improve the data quality by some manual effort? Or are you just trying to live with that level of quality by using some kind of "fuzzy matching"?Doc Brown– Doc Brown2014年12月11日 13:52:21 +00:00Commented Dec 11, 2014 at 13:52
-
1An obvious additional date check would be age - any age over 90 or under 10 might merit additional review. A birth year of 1499 seems unlikely if it's for a living person.Dan Pichelman– Dan Pichelman2014年12月11日 16:09:39 +00:00Commented Dec 11, 2014 at 16:09
-
1@w0051977: String dates could be difficult. You'll have to decide how you want to handle strings like 01/02/1980 vs. 02/01/1980: Is it Feb. 01 1980 or Jan. 02 1980?FrustratedWithFormsDesigner– FrustratedWithFormsDesigner2014年12月11日 16:58:27 +00:00Commented Dec 11, 2014 at 16:58
-
1I can still see no real effort to clarify the question - so voting to close as "unclear what you are asking". The thing which is missing is your use case. It is not even clear if you want to validate newly added data (and maybe reject it when it does not look plausible), change/improve the existing data, or just do some searching within the data without changing it.Doc Brown– Doc Brown2014年12月12日 09:08:07 +00:00Commented Dec 12, 2014 at 9:08
1 Answer 1
For date of birth and people in general, one will need to validate if that person exists and if thier address is valid.
You can do both of these using 3rd party services that will:
- Validate if the person is a real person
- Validate and standardize address information
You can do this yourself using algorithms that you have mentioned, but why re-invent the wheel and most likely your implementation will be flawed.
Most of these services are configurable and allow you to send information like name, DOB, phone number and will attempt to look up to see if that person exists.
The same holds true for address information. It will attempt standardize the address, locate it (see if it really exists) and then return the standardized address back to you.
In both of these scenarios, no hits can be returned, which means that the person is not found or the addresss is not found. In those cases you probably have a fake individual or address like the one in your sample data.
Also, sometimes it might return multiple matches or less than 100% matches. For example, if the only piece of valid information is a phone number, it might return back multiple people at that phone number.
For less than absolute matches you might set a confidence factor which you want to match on, like 90%. Anything below that number goes into a report which will require manual intervention. People who don't match should be eliminated from your datastore as it is junk data.
-
There are 7 million people to check daily. Therefore I believe I have to reinvent the wheel.w0051977– w00519772014年12月11日 23:05:43 +00:00Commented Dec 11, 2014 at 23:05
Explore related questions
See similar questions with these tags.