I have a large datafile with names and I want to create a similarity distance matrix. With this matrix I want to get similar names that could be the same person (or not) and that I could compare these rows and check whether more variables match or not.
However the code I have is quite slow. The dataframe has 58797 rows and some of them are repeated names. I was wondering for other options or a better way to get the information I'm looking for.
This is the code I have so far:
similar <- list()
for (i in 1:dim(data)[1]) {
ids <- which(levenshteinSim(data$nomeAlt[i], data$nomeAlt) != 1 &
levenshteinSim(data$nomeAlt[i], data$nomeAlt) > 0.85)
# ifelse only returns first element of list, instead use separate if else
similar[[i]] <- if (length(ids) == 0) NA else ids
print(i) # to get an update of the progress
}
Basically, the output returns rownames which I can get the names. In a working exemple I got names such as "ABEL MACEDO ALVES" and "ABEL MACHADO ALVES".
Any suggestion would be appreciated. Thank you!
1 Answer 1
Here is an implementation of the ideas I had suggested in the comments: to store the output of levenshteinSim
so it is only called once, and to limit the expensive name comparisons to individuals that share the same initials. I hope it helps.
names_vec <- data$nomeAlt
initials <- gsub("\\b(.).*?\\b", "\1円", x)
similar <- list()
for (i in 1:length(names_vec)) {
ini <- initials == initials[i]
sim <- levenshteinSim(names_vec[i], names_vec[ini])
idx <- which(sim > 0.85 & sim != 1)
similar[[i]] <- if (length(idx) == 0) NA else ini[idx]
print(i) # to get an update of the progress
}
-
\$\begingroup\$ Thanks! I had no idea storing the output of
levenshteinSim
would make such difference. \$\endgroup\$psoares– psoares2017年11月18日 18:43:52 +00:00Commented Nov 18, 2017 at 18:43