Code to calculate similarity between strings - R

Asked 7 years, 10 months ago

Viewed 853 times

\$\begingroup\$

I have a large datafile with names and I want to create a similarity distance matrix. With this matrix I want to get similar names that could be the same person (or not) and that I could compare these rows and check whether more variables match or not.

However the code I have is quite slow. The dataframe has 58797 rows and some of them are repeated names. I was wondering for other options or a better way to get the information I'm looking for.

This is the code I have so far:

similar <- list()
for (i in 1:dim(data)[1]) {
 ids <- which(levenshteinSim(data$nomeAlt[i], data$nomeAlt) != 1 & 
 levenshteinSim(data$nomeAlt[i], data$nomeAlt) > 0.85)
 # ifelse only returns first element of list, instead use separate if else
 similar[[i]] <- if (length(ids) == 0) NA else ids
 print(i) # to get an update of the progress
}

Basically, the output returns rownames which I can get the names. In a working exemple I got names such as "ABEL MACEDO ALVES" and "ABEL MACHADO ALVES".

Any suggestion would be appreciated. Thank you!

asked Nov 17, 2017 at 14:04

psoares's user avatar

psoares psoares

2011 silver badge5 bronze badges

\$\endgroup\$

Add a comment |

1 Answer 1

Sorted by: Reset to default

\$\begingroup\$

Here is an implementation of the ideas I had suggested in the comments: to store the output of levenshteinSim so it is only called once, and to limit the expensive name comparisons to individuals that share the same initials. I hope it helps.

names_vec <- data$nomeAlt
initials <- gsub("\\b(.).*?\\b", "\1円", x)
similar <- list()
for (i in 1:length(names_vec)) {
 ini <- initials == initials[i]
 sim <- levenshteinSim(names_vec[i], names_vec[ini])
 idx <- which(sim > 0.85 & sim != 1)
 similar[[i]] <- if (length(idx) == 0) NA else ini[idx]
 print(i) # to get an update of the progress
}

answered Nov 18, 2017 at 14:10

flodel's user avatar

flodel flodel

3,5551 gold badge16 silver badges15 bronze badges

\$\endgroup\$

\$\begingroup\$ Thanks! I had no idea storing the output of levenshteinSim would make such difference. \$\endgroup\$

psoares
– psoares

2017年11月18日 18:43:52 +00:00
Commented Nov 18, 2017 at 18:43

Add a comment |

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-r

Stack Exchange Network

Code to calculate similarity between strings - R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Code to calculate similarity between strings - R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions