0
$\begingroup$

The focus of this question is on natural language processing, specifically matching names between 2 lists. I am looking at employees that work in the same organization, however I obtained data from two different databases. Unfortunately, there is no unique key or ID that matches the users between lists, so I have to rely upon a string matching between names.

The challenge is that the names on the two lists are approximately similar and there are different levels of noise depending on the name. So I know that the bias should be for matches between names, however it is hard to impose a threshold on similarity because the names different in length and the amount of noise in each name.

For example, here is a little sample of some simulated data.

List 1:

  • Jim Smith
  • J. Smith
  • Carol Sanchez
  • Shantanu Vishwanathan

List 2:

  • J. Smith.
  • C. Sanchez
  • Shantannu Vishvanathan
  • S. Vishvanathan.

I use the Julia language, but the language itself is not important. I really just want to figure out a good algorithm to do this type of matching. There are a lot of names to go through, so I am trying to limit the amount of manual checking and such that I need to do.

When I reference a string distance, I mean something like the Levenshtein distance or Jaccard distance between two strings. There are various metrics that might work.

So the naive approach is something like:

  1. remove punctuation and spaces from first and last names in both list 1 and list 2.
  2. compute string distance between the last name of the first entry (source) in list 1 and all of the last names of the entries on list 2 (called targets). a. collect the 3 target names with the lowest distance scores into a new vector called the subsample vector. b. compute whether the first letter of the first names match between the source name and target first names in the subsample vector. Remove any names from the subsample vector that start with a different letter than the source first name. c. assign a match if the string distance between the first name of the target and the first name of the source is less than some threshold.

So this is a very naive algorithm. I can add some more nuance by using a sum of different string distances instead of only a single string distance, etc. So I have some options there.

But I was not sure if there was a better way or algorithm to do something like this? Like would it make sense to think of this problem like a constraint programming problem or such. That was just one idea that I had. Also, would it make sense to find the longest common substring between last names or such. I can try a few different approaches and benchmark the accuracy, that goes without saying. But just wanted to see if there was some obvious algorithm that I was just not aware of.

If anyone has a suggestion, please let me know. Hopefully this question conforms with the guidelines of the CS stack exchange. I tried to focus on the specific algorithm and also referenced the NLP nature of the problem.

asked Mar 9 at 2:23
$\endgroup$
1

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.