1

Let's say I have the following lines:

Lorem ipsum dolor sit amet, (tag) consectetur adipiscing elit.
Phasellus congue nisi vel lorem dignissim tristique. (tag)
Etiam vulputate lacus nec velit lobortis ut adipiscing mauris condimentum. (tag )
Vestibulum quis nulla id nisi ultricies placerat. (teg)
In tempus vulputate ante, quis suscipit nunc euismod in. (teg )
Phasellus iaculis diam in felis pellentesque tristique. (ta g)
(tag Praesent vestibulum sollicitudin velit, non mollis lacus scelerisque in.
Nunc ac erat enim, at lobortis libero. (interesting)

There's supposed to be a tag in each line enclosed in parenthesis. But as you can see the data is very dirty and there's extra spaces inside tags, missing parents, misspelled tags, etc.

How would I identify the lines containing "(tag)" or something similar? The output I would expect in the example is to retrieve all lines, except for the last.

I need to compare a long string to a shorter sub-string, so something like Levenshtein distance wont work (or will it?). Tokenized techniques seem to rigid, because although I might tokenize everything in parenthesis, what happens when a parenthesis is missing?

gnat
20.5k29 gold badges117 silver badges308 bronze badges
asked May 16, 2012 at 16:52

2 Answers 2

2

I've done something similar to this a while back, only I was just matching words to a sought word. This might not help you, but it might give you an idea.

Essentially, what I did was something like this:

For each word in my range and did something similar to a diff. I considered the letter in the string and its position. For correct letters I gave a point, for correct position in the string I gave a point. For bad and placement I would remove a point, and for missing letters I would remove a point. (It's hard to remember, because I don't have access to the source anymore, but I think I also based the placement point off the distance from the expected position).

After I tallied up the 'score' of this word, I would normalize. I had an arbitrary threshold (found via trial and error, but it could be feedback driven) that I used to determine if it was close enough.

Then all of the words that were above the threshold were returned to the user in descending order.

IIRC, it wasn't terribly effective for short words, but overall, it was fairly effective for its implementation (search over a specific domain).

Hope that helps you off into a productive direction.

answered May 16, 2012 at 17:08
1
  • 1
    Sounds like an ad-hoc version of the Levenshtein distance algorithm. Commented May 16, 2012 at 17:21
0

I will answer my own question with these links:

http://codehost.wordpress.com/2011/09/13/fuzzy-substring-matching/

http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/

The second link contains the following function:

def fuzzy_substring(needle, haystack):
 """Calculates the fuzzy match of needle in haystack,
 using a modified version of the Levenshtein distance
 algorithm.
 The function is modified from the levenshtein function
 in the bktree module by Adam Hupp"""
 m, n = len(needle), len(haystack)
 # base cases
 if m == 1:
 return not needle in haystack
 if not n:
 return m
 row1 = [0] * (n+1)
 for i in range(0,m):
 row2 = [i+1]
 for j in range(0,n):
 cost = ( needle[i] != haystack[j] )
 row2.append( min(row1[j+1]+1, # deletion
 row2[j]+1, #insertion
 row1[j]+cost) #substitution
 )
 row1 = row2
 return min(row1)
lines = """Lorem ipsum dolor sit amet, (tag) consectetur adipiscing elit.
Phasellus congue nisi vel lorem dignissim tristique. (tag)
Etiam vulputate lacus nec velit lobortis ut adipiscing mauris condimentum. (tag )
Vestibulum quis nulla id nisi ultricies placerat. (teg)
In tempus vulputate ante, quis suscipit nunc euismod in. (teg )
Phasellus iaculis diam in felis pellentesque tristique. (ta g)
(tag Praesent vestibulum sollicitudin velit, non mollis lacus scelerisque in.
Nunc ac erat enim, at lobortis libero. (interesting)""".split('\n')
for line in lines:
 print fuzzy_substring('(tag)', line), line

The output is:

0 Lorem ipsum dolor sit amet, (tag) consectetur adipiscing elit.
0 Phasellus congue nisi vel lorem dignissim tristique. (tag)
1 Etiam vulputate lacus nec velit lobortis ut adipiscing mauris condimentum. (tag )
1 Vestibulum quis nulla id nisi ultricies placerat. (teg)
2 In tempus vulputate ante, quis suscipit nunc euismod in. (teg )
1 Phasellus iaculis diam in felis pellentesque tristique. (ta g)
1 (tag Praesent vestibulum sollicitudin velit, non mollis lacus scelerisque in.
3 Nunc ac erat enim, at lobortis libero. (interesting)
answered May 17, 2012 at 15:30

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.