2
\$\begingroup\$

I have a large datafile with names and I want to create a similarity distance matrix. With this matrix I want to get similar names that could be the same person (or not) and that I could compare these rows and check whether more variables match or not.

However the code I have is quite slow. The dataframe has 58797 rows and some of them are repeated names. I was wondering for other options or a better way to get the information I'm looking for.

This is the code I have so far:

similar <- list()
for (i in 1:dim(data)[1]) {
 ids <- which(levenshteinSim(data$nomeAlt[i], data$nomeAlt) != 1 & 
 levenshteinSim(data$nomeAlt[i], data$nomeAlt) > 0.85)
 # ifelse only returns first element of list, instead use separate if else
 similar[[i]] <- if (length(ids) == 0) NA else ids
 print(i) # to get an update of the progress
} 

Basically, the output returns rownames which I can get the names. In a working exemple I got names such as "ABEL MACEDO ALVES" and "ABEL MACHADO ALVES".

Any suggestion would be appreciated. Thank you!

asked Nov 17, 2017 at 14:04
\$\endgroup\$
0

1 Answer 1

1
\$\begingroup\$

Here is an implementation of the ideas I had suggested in the comments: to store the output of levenshteinSim so it is only called once, and to limit the expensive name comparisons to individuals that share the same initials. I hope it helps.

names_vec <- data$nomeAlt
initials <- gsub("\\b(.).*?\\b", "\1円", x)
similar <- list()
for (i in 1:length(names_vec)) {
 ini <- initials == initials[i]
 sim <- levenshteinSim(names_vec[i], names_vec[ini])
 idx <- which(sim > 0.85 & sim != 1)
 similar[[i]] <- if (length(idx) == 0) NA else ini[idx]
 print(i) # to get an update of the progress
}
answered Nov 18, 2017 at 14:10
\$\endgroup\$
1
  • \$\begingroup\$ Thanks! I had no idea storing the output of levenshteinSim would make such difference. \$\endgroup\$ Commented Nov 18, 2017 at 18:43

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.