Fuzzy search for a sub-string without tokens

Question 1

Let's say I have the following lines:

Lorem ipsum dolor sit amet, (tag) consectetur adipiscing elit.
Phasellus congue nisi vel lorem dignissim tristique. (tag)
Etiam vulputate lacus nec velit lobortis ut adipiscing mauris condimentum. (tag )
Vestibulum quis nulla id nisi ultricies placerat. (teg)
In tempus vulputate ante, quis suscipit nunc euismod in. (teg )
Phasellus iaculis diam in felis pellentesque tristique. (ta g)
(tag Praesent vestibulum sollicitudin velit, non mollis lacus scelerisque in.
Nunc ac erat enim, at lobortis libero. (interesting)

There's supposed to be a tag in each line enclosed in parenthesis. But as you can see the data is very dirty and there's extra spaces inside tags, missing parents, misspelled tags, etc.

How would I identify the lines containing "(tag)" or something similar? The output I would expect in the example is to retrieve all lines, except for the last.

I need to compare a long string to a shorter sub-string, so something like Levenshtein distance wont work (or will it?). Tokenized techniques seem to rigid, because although I might tokenize everything in parenthesis, what happens when a parenthesis is missing?

Question 2

I've done something similar to this a while back, only I was just matching words to a sought word. This might not help you, but it might give you an idea.

Essentially, what I did was something like this:

For each word in my range and did something similar to a diff. I considered the letter in the string and its position. For correct letters I gave a point, for correct position in the string I gave a point. For bad and placement I would remove a point, and for missing letters I would remove a point. (It's hard to remember, because I don't have access to the source anymore, but I think I also based the placement point off the distance from the expected position).

After I tallied up the 'score' of this word, I would normalize. I had an arbitrary threshold (found via trial and error, but it could be feedback driven) that I used to determine if it was close enough.

Then all of the words that were above the threshold were returned to the user in descending order.

IIRC, it wasn't terribly effective for short words, but overall, it was fairly effective for its implementation (search over a specific domain).

Hope that helps you off into a productive direction.

Question 3

Sounds like an ad-hoc version of the Levenshtein distance algorithm.

Question 4

I will answer my own question with these links:

http://codehost.wordpress.com/2011/09/13/fuzzy-substring-matching/

http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/

The second link contains the following function:

def fuzzy_substring(needle, haystack):
 """Calculates the fuzzy match of needle in haystack,
 using a modified version of the Levenshtein distance
 algorithm.
 The function is modified from the levenshtein function
 in the bktree module by Adam Hupp"""
 m, n = len(needle), len(haystack)
 # base cases
 if m == 1:
 return not needle in haystack
 if not n:
 return m
 row1 = [0] * (n+1)
 for i in range(0,m):
 row2 = [i+1]
 for j in range(0,n):
 cost = ( needle[i] != haystack[j] )
 row2.append( min(row1[j+1]+1, # deletion
 row2[j]+1, #insertion
 row1[j]+cost) #substitution
 )
 row1 = row2
 return min(row1)
lines = """Lorem ipsum dolor sit amet, (tag) consectetur adipiscing elit.
Phasellus congue nisi vel lorem dignissim tristique. (tag)
Etiam vulputate lacus nec velit lobortis ut adipiscing mauris condimentum. (tag )
Vestibulum quis nulla id nisi ultricies placerat. (teg)
In tempus vulputate ante, quis suscipit nunc euismod in. (teg )
Phasellus iaculis diam in felis pellentesque tristique. (ta g)
(tag Praesent vestibulum sollicitudin velit, non mollis lacus scelerisque in.
Nunc ac erat enim, at lobortis libero. (interesting)""".split('\n')
for line in lines:
 print fuzzy_substring('(tag)', line), line

The output is:

0 Lorem ipsum dolor sit amet, (tag) consectetur adipiscing elit.
0 Phasellus congue nisi vel lorem dignissim tristique. (tag)
1 Etiam vulputate lacus nec velit lobortis ut adipiscing mauris condimentum. (tag )
1 Vestibulum quis nulla id nisi ultricies placerat. (teg)
2 In tempus vulputate ante, quis suscipit nunc euismod in. (teg )
1 Phasellus iaculis diam in felis pellentesque tristique. (ta g)
1 (tag Praesent vestibulum sollicitudin velit, non mollis lacus scelerisque in.
3 Nunc ac erat enim, at lobortis libero. (interesting)

Steven Evers Steven Evers 28.2k10 gold badges77 silver badges161 bronze badges · Answer 1 · 2012-05-16 17:08:38Z

I've done something similar to this a while back, only I was just matching words to a sought word. This might not help you, but it might give you an idea.

Essentially, what I did was something like this:

For each word in my range and did something similar to a diff. I considered the letter in the string and its position. For correct letters I gave a point, for correct position in the string I gave a point. For bad and placement I would remove a point, and for missing letters I would remove a point. (It's hard to remember, because I don't have access to the source anymore, but I think I also based the placement point off the distance from the expected position).

After I tallied up the 'score' of this word, I would normalize. I had an arbitrary threshold (found via trial and error, but it could be feedback driven) that I used to determine if it was close enough.

Then all of the words that were above the threshold were returned to the user in descending order.

IIRC, it wasn't terribly effective for short words, but overall, it was fairly effective for its implementation (search over a specific domain).

Hope that helps you off into a productive direction.

Sounds like an ad-hoc version of the Levenshtein distance algorithm.

Buttons840 Buttons840 1,8861 gold badge18 silver badges28 bronze badges · Answer 2 · 2012-05-17 15:30:35Z

I will answer my own question with these links:

http://codehost.wordpress.com/2011/09/13/fuzzy-substring-matching/

http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/

The second link contains the following function:

def fuzzy_substring(needle, haystack):
 """Calculates the fuzzy match of needle in haystack,
 using a modified version of the Levenshtein distance
 algorithm.
 The function is modified from the levenshtein function
 in the bktree module by Adam Hupp"""
 m, n = len(needle), len(haystack)
 # base cases
 if m == 1:
 return not needle in haystack
 if not n:
 return m
 row1 = [0] * (n+1)
 for i in range(0,m):
 row2 = [i+1]
 for j in range(0,n):
 cost = ( needle[i] != haystack[j] )
 row2.append( min(row1[j+1]+1, # deletion
 row2[j]+1, #insertion
 row1[j]+cost) #substitution
 )
 row1 = row2
 return min(row1)
lines = """Lorem ipsum dolor sit amet, (tag) consectetur adipiscing elit.
Phasellus congue nisi vel lorem dignissim tristique. (tag)
Etiam vulputate lacus nec velit lobortis ut adipiscing mauris condimentum. (tag )
Vestibulum quis nulla id nisi ultricies placerat. (teg)
In tempus vulputate ante, quis suscipit nunc euismod in. (teg )
Phasellus iaculis diam in felis pellentesque tristique. (ta g)
(tag Praesent vestibulum sollicitudin velit, non mollis lacus scelerisque in.
Nunc ac erat enim, at lobortis libero. (interesting)""".split('\n')
for line in lines:
 print fuzzy_substring('(tag)', line), line

The output is:

0 Lorem ipsum dolor sit amet, (tag) consectetur adipiscing elit.
0 Phasellus congue nisi vel lorem dignissim tristique. (tag)
1 Etiam vulputate lacus nec velit lobortis ut adipiscing mauris condimentum. (tag )
1 Vestibulum quis nulla id nisi ultricies placerat. (teg)
2 In tempus vulputate ante, quis suscipit nunc euismod in. (teg )
1 Phasellus iaculis diam in felis pellentesque tristique. (ta g)
1 (tag Praesent vestibulum sollicitudin velit, non mollis lacus scelerisque in.
3 Nunc ac erat enim, at lobortis libero. (interesting)

Stack Exchange Network

Fuzzy search for a sub-string without tokens

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Fuzzy search for a sub-string without tokens

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions