Edit Distance Implementation

Question 1

I was doing some reading on the Wagner Fischer algorithm algorithm for computing the edit distance on strings. None of the examples in the literature I have read showed how to recover the edit sequences from the final populated matrix. The examples only showed total number of edits.

The logic I am using is simply calling min() on the 4 x 4 submatrix of matrix L that ends L[m][n] until I reach L[0][0]. It seems like the logic should be to compare the value returned by min() and compare it to the current L[i][j] to recover the edit at that position. Then we simply append this to an array that represents the total edit operations and finally reverse the list.

I have tested this code and compared it against the Python Levenshtein module on pypi.org and it looks to be OK. However, I wanted to get some feedback on this methodology of how to output the actual edit sequence. Is there some boundary case I might be missing?

def edit_distance(x, y):
 m = len(x)
 n = len(y)
 L = [ [0] * (n+1) for i in xrange(m+1) ]
 for i in range(0, m+1):
 L[i][0] = i
 for j in range(0, n+1):
 L[0][j] = j
 for i in range (1, m+1):
 for j in range(1, n+1):
 if x[i-1] == y[j-1]:
 L[i][j] = L[i-1][j-1]
 else:
 L[i][j] = min(L[i][j-1], L[i-1][j], L[i-1][j-1]) + 1
 edits = []
 i = m
 j = n
 while i > 0 and j > 0:
 back = min(L[i][j-1], L[i-1][j], L[i-1][j-1])
 if back == L[i-1][j-1]:
 if back == L[i][j]:
 edits.append('NOOP')
 elif back == L[i][j] - 1:
 edits.append('SUBST')
 i -= 1
 j -= 1
 elif back == L[i][j-1]:
 edits.append('INSERT')
 j -= 1
 elif back == L[i-1][j]:
 edits.append('DELETE')
 i -= 1
 print x
 print edits[::-1]
 print y

Question 2

I've written this in python2 because the question is posed in python2. It is advised that users switch over to python3, as it is reaching end-of-life at the end of 2019.

The algorithm

I don't see anything really glaring as far as the algorithm is concerned. The major edge case I can see is with lists/strings of length 1, but you handle that case fine

Pre-allocating your matrix

Without trying to use something like numpy, your pre-allocation step can be simplified a bit. You iterate twice over the whole list, once to fill it with zeros, and once over your respective ranges n and m. The n loop can be done with a simple slice assignment:

# use xrange for initialization here
# that way you aren't allocating a list into memory
L = [[0] * (n + 1) for i in xrange(m + 1)]
# Use a slice to assign the first row
# with a range, since that assigns a list to slice
L[0][:] = range(n + 1)
# then you only have one other for loop to do
# again, just use xrange here
for i in xrange(m + 1):
 L[i][0] = i

`i` and `j`

I'm not sure the following step really does anything and is necessary:

i = m
j = n

Furthermore, the while loop is more pythonic if you rely on the truthiness of nonzero integers:

while m and n:
 # do things

`while` loop

There are continuous index lookups. While lookups by index are fast, doing them over and over impairs readability, even if the penalty incurred is minor

while m and n:
 current, back_y, back_x, back_both = L[m][n], L[m][n-1], L[m-1][n], L[m-1][n-1]
 back = min(back_y, back_x, back_both)
 # rest of loop

Now you can use these as your placeholders for your if logic

while m and n:
 current, back_y, back_x, back_both = L[m][n], L[m][n-1], L[m-1][n], L[m-1][n-1]
 back = min(back_y, back_x, back_both)
 if back == back_both:
 if back == current:
 edits.append('NOOP')
 elif back == current - 1:
 edits.append('SUBST')
 m -= 1
 n -= 1
 elif back == back_y:
 edits.append('INSERT')
 n -= 1
 elif back == back_x:
 edits.append('DELETE')
 m -= 1

Timing

Looking at the timing differences between the two:

original

python -m timeit -s 'from otherfile import edit_distance, revised_edit; x = range(4); y = range(3,7)' 'edit_distance(x, y)'
10000 loops, best of 3: 29.3 usec per loop

revised

python -m timeit -s 'from otherfile import edit_distance, revised_edit; x = range(4); y = range(3,7)' 'revised_edit(x, y)'
100000 loops, best of 3: 10.1 usec per loop

A 65% speedup for small lists. Let's see about (relatively) larger ones

original

python -m timeit -s 'from otherfile import edit_distance, revised_edit; x = range(400); y = range(300,700)' 'edit_distance(x, y)'
10 loops, best of 3: 152 msec per loop

revised

python -m timeit -s 'from otherfile import edit_distance, revised_edit; x = range(400); y = range(300,700)' 'revised_edit(x, y)'
1000 loops, best of 3: 1.31 msec per loop

The difference is starting to diverge, two entire orders of magnitude here. Whether or not you would use this on large lists is I suppose an implementation detail, but worth a look. Not to imply that a 400 element list is big, but it's an improvement for sure.

The timing does not include the time required to reverse edits using a edits[::-1] because that is going to be a constant, and looking at slicing a 400 element list in this way, the time is insignificant compared to the rest of the algorithm (1000000 loops, best of 3: 0.967 usec per loop for range(400), if you were curious).

`return edit`

You never return edits! print does not return. It's also advisable to change print val to print(val), as the latter is portable across python 2 and 3, and it's recommended you be on python3 already anyways. So at the end of your function, be sure to include:

 return edits[::-1]

C.Nivs C.Nivs 3,11714 silver badges32 bronze badges · Answer 1 · 2019-10-01 21:25:19Z

I've written this in python2 because the question is posed in python2. It is advised that users switch over to python3, as it is reaching end-of-life at the end of 2019.

The algorithm

I don't see anything really glaring as far as the algorithm is concerned. The major edge case I can see is with lists/strings of length 1, but you handle that case fine

Pre-allocating your matrix

Without trying to use something like numpy, your pre-allocation step can be simplified a bit. You iterate twice over the whole list, once to fill it with zeros, and once over your respective ranges n and m. The n loop can be done with a simple slice assignment:

# use xrange for initialization here
# that way you aren't allocating a list into memory
L = [[0] * (n + 1) for i in xrange(m + 1)]
# Use a slice to assign the first row
# with a range, since that assigns a list to slice
L[0][:] = range(n + 1)
# then you only have one other for loop to do
# again, just use xrange here
for i in xrange(m + 1):
 L[i][0] = i

`i` and `j`

I'm not sure the following step really does anything and is necessary:

i = m
j = n

Furthermore, the while loop is more pythonic if you rely on the truthiness of nonzero integers:

while m and n:
 # do things

`while` loop

There are continuous index lookups. While lookups by index are fast, doing them over and over impairs readability, even if the penalty incurred is minor

while m and n:
 current, back_y, back_x, back_both = L[m][n], L[m][n-1], L[m-1][n], L[m-1][n-1]
 back = min(back_y, back_x, back_both)
 # rest of loop

Now you can use these as your placeholders for your if logic

while m and n:
 current, back_y, back_x, back_both = L[m][n], L[m][n-1], L[m-1][n], L[m-1][n-1]
 back = min(back_y, back_x, back_both)
 if back == back_both:
 if back == current:
 edits.append('NOOP')
 elif back == current - 1:
 edits.append('SUBST')
 m -= 1
 n -= 1
 elif back == back_y:
 edits.append('INSERT')
 n -= 1
 elif back == back_x:
 edits.append('DELETE')
 m -= 1

Timing

Looking at the timing differences between the two:

original

python -m timeit -s 'from otherfile import edit_distance, revised_edit; x = range(4); y = range(3,7)' 'edit_distance(x, y)'
10000 loops, best of 3: 29.3 usec per loop

revised

python -m timeit -s 'from otherfile import edit_distance, revised_edit; x = range(4); y = range(3,7)' 'revised_edit(x, y)'
100000 loops, best of 3: 10.1 usec per loop

A 65% speedup for small lists. Let's see about (relatively) larger ones

original

python -m timeit -s 'from otherfile import edit_distance, revised_edit; x = range(400); y = range(300,700)' 'edit_distance(x, y)'
10 loops, best of 3: 152 msec per loop

revised

python -m timeit -s 'from otherfile import edit_distance, revised_edit; x = range(400); y = range(300,700)' 'revised_edit(x, y)'
1000 loops, best of 3: 1.31 msec per loop

The difference is starting to diverge, two entire orders of magnitude here. Whether or not you would use this on large lists is I suppose an implementation detail, but worth a look. Not to imply that a 400 element list is big, but it's an improvement for sure.

The timing does not include the time required to reverse edits using a edits[::-1] because that is going to be a constant, and looking at slicing a 400 element list in this way, the time is insignificant compared to the rest of the algorithm (1000000 loops, best of 3: 0.967 usec per loop for range(400), if you were curious).

`return edit`

You never return edits! print does not return. It's also advisable to change print val to print(val), as the latter is portable across python 2 and 3, and it's recommended you be on python3 already anyways. So at the end of your function, be sure to include:

 return edits[::-1]

Stack Exchange Network

Edit Distance Implementation

1 Answer 1

The algorithm

Pre-allocating your matrix

`i` and `j`

`while` loop

Timing

original

revised

original

revised

`return edit`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Edit Distance Implementation

1 Answer 1

The algorithm

Pre-allocating your matrix

i and j

while loop

Timing

original

revised

original

revised

return edit

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`i` and `j`

`while` loop

`return edit`