12

I would like to know if there is a library that will tell me approximately how similar two strings are

I am not looking for anything specific, but in this case:

a = 'alex is a buff dude'
b = 'a;exx is a buff dud'

we could say that b and a are approximately 90% similar.

Is there a library which can do this?

viraptor
34.3k13 gold badges114 silver badges204 bronze badges
asked Aug 23, 2010 at 20:29
1

4 Answers 4

23
import difflib
>>> a = 'alex is a buff dude'
>>> b = 'a;exx is a buff dud'
>>> difflib.SequenceMatcher(None, a, b).ratio()
0.89473684210526316
answered Aug 23, 2010 at 21:06

Comments

8

http://en.wikipedia.org/wiki/Levenshtein_distance

There are a few libraries on pypi, but be aware that this is expensive, especially for longer strings.

You may also want to check out python's difflib: http://docs.python.org/library/difflib.html

answered Aug 23, 2010 at 20:35

2 Comments

expensive? difflib is a monster compared to semi-decent Levenshtein implementations.
It wasn't my intention to suggest that difflib is less expensive -- it just does a similar, albeit a little different, thing.
6

Look for Levenshtein algorithm for comparing strings. Here's a random implementation found via google: http://hetland.org/coding/python/levenshtein.py

answered Aug 23, 2010 at 20:34

Comments

1

Other way is to use longest common substring. Here a implementation in Daniweb with my lcs implementation (this is also defined in difflib)

Here is simple length only version with list as data structure:

def longest_common_sequence(a,b):
 n1=len(a)
 n2=len(b)
 previous=[]
 for i in range(n2):
 previous.append(0)
 over = 0
 for ch1 in a:
 left = corner = 0
 for ch2 in b:
 over = previous.pop(0)
 if ch1 == ch2:
 this = corner + 1
 else:
 this = over if over >= left else left
 previous.append(this)
 left, corner = this, over
 return 200.0*previous.pop()/(n1+n2)

Here is my second version which actualy gives the common string with deque data structure (also with the example data use case):

from collections import deque
a = 'alex is a buff dude'
b = 'a;exx is a buff dud'
def lcs_tuple(a,b):
 n1=len(a)
 n2=len(b)
 previous=deque()
 for i in range(n2):
 previous.append((0,''))
 over = (0,'')
 for i in range(n1):
 left = corner = (0,'')
 for j in range(n2):
 over = previous.popleft()
 if a[i] == b[j]:
 this = corner[0] + 1, corner[1]+a[i]
 else:
 this = max(over,left)
 previous.append(this)
 left, corner = this, over
 return 200.0*this[0]/(n1+n2),this[1]
print lcs_tuple(a,b)
""" Output:
(89.47368421052632, 'aex is a buff dud')
"""
answered Aug 23, 2010 at 21:12

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.