The focus of this question is on natural language processing, specifically matching names between 2 lists. I am looking at employees that work in the same organization, however I obtained data from two different databases. Unfortunately, there is no unique key or ID that matches the users between lists, so I have to rely upon a string matching between names.
The challenge is that the names on the two lists are approximately similar and there are different levels of noise depending on the name. So I know that the bias should be for matches between names, however it is hard to impose a threshold on similarity because the names different in length and the amount of noise in each name.
For example, here is a little sample of some simulated data.
List 1:
- Jim Smith
- J. Smith
- Carol Sanchez
- Shantanu Vishwanathan
List 2:
- J. Smith.
- C. Sanchez
- Shantannu Vishvanathan
- S. Vishvanathan.
I use the Julia language, but the language itself is not important. I really just want to figure out a good algorithm to do this type of matching. There are a lot of names to go through, so I am trying to limit the amount of manual checking and such that I need to do.
When I reference a string distance, I mean something like the Levenshtein distance or Jaccard distance between two strings. There are various metrics that might work.
So the naive approach is something like:
- remove punctuation and spaces from first and last names in both list 1 and list 2.
- compute string distance between the last name of the first entry (source) in list 1 and all of the last names of the entries on list 2 (called targets). a. collect the 3 target names with the lowest distance scores into a new vector called the subsample vector. b. compute whether the first letter of the first names match between the source name and target first names in the subsample vector. Remove any names from the subsample vector that start with a different letter than the source first name. c. assign a match if the string distance between the first name of the target and the first name of the source is less than some threshold.
So this is a very naive algorithm. I can add some more nuance by using a sum of different string distances instead of only a single string distance, etc. So I have some options there.
But I was not sure if there was a better way or algorithm to do something like this? Like would it make sense to think of this problem like a constraint programming problem or such. That was just one idea that I had. Also, would it make sense to find the longest common substring between last names or such. I can try a few different approaches and benchmark the accuracy, that goes without saying. But just wanted to see if there was some obvious algorithm that I was just not aware of.
If anyone has a suggestion, please let me know. Hopefully this question conforms with the guidelines of the CS stack exchange. I tried to focus on the specific algorithm and also referenced the NLP nature of the problem.
-
$\begingroup$ you can try LLM: chat.deepseek.com/a/chat/s/69060b16-8889-45f8-ade4-85536039d1f8 $\endgroup$Bulat– Bulat2025年03月09日 22:29:45 +00:00Commented Mar 9 at 22:29