Matching algorithms in context of data quality problems

Question 1

I work for an organization that has lots of databases containing person information. The data quality is poor. One case was a surname I found like this (this is the worst-case scenario):

Mark "Dunno his surname, but it sounded like Lion 'RAR', Ha ha"

There is a date of birth of: 01/09/1499

This is a worst-case scenario. Most data quality issues are due to pressing a wrong key on the keyboard e.g. Snith instead of Smith (n is next me m on the keyboard).

I am looking for algorithms which can help me with some kind of "fuzzy matching" under these circumstances. Our requirements involve several millions of records per day. I have looked for "data matching" and discovered the following algorithms:

SQL SOUNDEX 
SQL METAPHONE
Levenshtein Distance

Also is there such thing as a possible match for dates of birth? A possible match for a surname were the Levenstein distance is 80%.

Therefore I have two questions:

What algorithms are available except the three specified above?
What approaches are used to match possible addresses

Question 2

Hmm, that I know of, there is no equivalent for birth dates, though a person is far more likely to be off by a digit (and not a day), so a good starting point is something that checks for the existence of birth dates whose digits are off by one.

Question 3

Asking for a "source of information" means asking for an off-site resource, which is off-topic according to the rules of this site. If you delete that part from your question make it more specificially, I see a chance for it not to be closed soon (for example: are you going to improve the data quality by some manual effort? Or are you just trying to live with that level of quality by using some kind of "fuzzy matching"?

Question 4

An obvious additional date check would be age - any age over 90 or under 10 might merit additional review. A birth year of 1499 seems unlikely if it's for a living person.

Question 5

@w0051977: String dates could be difficult. You'll have to decide how you want to handle strings like 01/02/1980 vs. 02/01/1980: Is it Feb. 01 1980 or Jan. 02 1980?

Question 6

I can still see no real effort to clarify the question - so voting to close as "unclear what you are asking". The thing which is missing is your use case. It is not even clear if you want to validate newly added data (and maybe reject it when it does not look plausible), change/improve the existing data, or just do some searching within the data without changing it.

Question 7

For date of birth and people in general, one will need to validate if that person exists and if thier address is valid.

You can do both of these using 3rd party services that will:

Validate if the person is a real person
Validate and standardize address information

You can do this yourself using algorithms that you have mentioned, but why re-invent the wheel and most likely your implementation will be flawed.

Most of these services are configurable and allow you to send information like name, DOB, phone number and will attempt to look up to see if that person exists.

The same holds true for address information. It will attempt standardize the address, locate it (see if it really exists) and then return the standardized address back to you.

In both of these scenarios, no hits can be returned, which means that the person is not found or the addresss is not found. In those cases you probably have a fake individual or address like the one in your sample data.

Also, sometimes it might return multiple matches or less than 100% matches. For example, if the only piece of valid information is a phone number, it might return back multiple people at that phone number.

For less than absolute matches you might set a confidence factor which you want to match on, like 90%. Anything below that number goes into a report which will require manual intervention. People who don't match should be eliminated from your datastore as it is junk data.

Question 8

There are 7 million people to check daily. Therefore I believe I have to reinvent the wheel.

Jon Raynor Jon RaynorJon Raynor 11.8k32 silver badges49 bronze badges · Answer 1 · 2014-12-11 18:46:04Z

For date of birth and people in general, one will need to validate if that person exists and if thier address is valid.

You can do both of these using 3rd party services that will:

Validate if the person is a real person
Validate and standardize address information

You can do this yourself using algorithms that you have mentioned, but why re-invent the wheel and most likely your implementation will be flawed.

Most of these services are configurable and allow you to send information like name, DOB, phone number and will attempt to look up to see if that person exists.

The same holds true for address information. It will attempt standardize the address, locate it (see if it really exists) and then return the standardized address back to you.

In both of these scenarios, no hits can be returned, which means that the person is not found or the addresss is not found. In those cases you probably have a fake individual or address like the one in your sample data.

Also, sometimes it might return multiple matches or less than 100% matches. For example, if the only piece of valid information is a phone number, it might return back multiple people at that phone number.

For less than absolute matches you might set a confidence factor which you want to match on, like 90%. Anything below that number goes into a report which will require manual intervention. People who don't match should be eliminated from your datastore as it is junk data.

There are 7 million people to check daily. Therefore I believe I have to reinvent the wheel.

Stack Exchange Network

Matching algorithms in context of data quality problems

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Matching algorithms in context of data quality problems

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions