I was doing some reading on the Wagner Fischer algorithm algorithm for computing the edit distance on strings. None of the examples in the literature I have read showed how to recover the edit sequences from the final populated matrix. The examples only showed total number of edits.
The logic I am using is simply calling min()
on the 4 x 4 submatrix of matrix L
that ends L[m][n]
until I reach L[0][0]
. It seems like the logic should be to compare the value returned by min()
and compare it to the current L[i][j]
to recover the edit at that position. Then we simply append this to an array that represents the total edit operations and finally reverse the list.
I have tested this code and compared it against the Python Levenshtein module on pypi.org and it looks to be OK. However, I wanted to get some feedback on this methodology of how to output the actual edit sequence. Is there some boundary case I might be missing?
def edit_distance(x, y):
m = len(x)
n = len(y)
L = [ [0] * (n+1) for i in xrange(m+1) ]
for i in range(0, m+1):
L[i][0] = i
for j in range(0, n+1):
L[0][j] = j
for i in range (1, m+1):
for j in range(1, n+1):
if x[i-1] == y[j-1]:
L[i][j] = L[i-1][j-1]
else:
L[i][j] = min(L[i][j-1], L[i-1][j], L[i-1][j-1]) + 1
edits = []
i = m
j = n
while i > 0 and j > 0:
back = min(L[i][j-1], L[i-1][j], L[i-1][j-1])
if back == L[i-1][j-1]:
if back == L[i][j]:
edits.append('NOOP')
elif back == L[i][j] - 1:
edits.append('SUBST')
i -= 1
j -= 1
elif back == L[i][j-1]:
edits.append('INSERT')
j -= 1
elif back == L[i-1][j]:
edits.append('DELETE')
i -= 1
print x
print edits[::-1]
print y
1 Answer 1
I've written this in python2 because the question is posed in python2. It is advised that users switch over to python3, as it is reaching end-of-life at the end of 2019.
The algorithm
I don't see anything really glaring as far as the algorithm is concerned. The major edge case I can see is with lists/strings of length 1, but you handle that case fine
Pre-allocating your matrix
Without trying to use something like numpy
, your pre-allocation step can be simplified a bit. You iterate twice over the whole list, once to fill it with zeros, and once over your respective ranges n
and m
. The n
loop can be done with a simple slice assignment:
# use xrange for initialization here
# that way you aren't allocating a list into memory
L = [[0] * (n + 1) for i in xrange(m + 1)]
# Use a slice to assign the first row
# with a range, since that assigns a list to slice
L[0][:] = range(n + 1)
# then you only have one other for loop to do
# again, just use xrange here
for i in xrange(m + 1):
L[i][0] = i
i
and j
I'm not sure the following step really does anything and is necessary:
i = m
j = n
Furthermore, the while
loop is more pythonic if you rely on the truthiness of nonzero integers:
while m and n:
# do things
while
loop
There are continuous index lookups. While lookups by index are fast, doing them over and over impairs readability, even if the penalty incurred is minor
while m and n:
current, back_y, back_x, back_both = L[m][n], L[m][n-1], L[m-1][n], L[m-1][n-1]
back = min(back_y, back_x, back_both)
# rest of loop
Now you can use these as your placeholders for your if logic
while m and n:
current, back_y, back_x, back_both = L[m][n], L[m][n-1], L[m-1][n], L[m-1][n-1]
back = min(back_y, back_x, back_both)
if back == back_both:
if back == current:
edits.append('NOOP')
elif back == current - 1:
edits.append('SUBST')
m -= 1
n -= 1
elif back == back_y:
edits.append('INSERT')
n -= 1
elif back == back_x:
edits.append('DELETE')
m -= 1
Timing
Looking at the timing differences between the two:
original
python -m timeit -s 'from otherfile import edit_distance, revised_edit; x = range(4); y = range(3,7)' 'edit_distance(x, y)'
10000 loops, best of 3: 29.3 usec per loop
revised
python -m timeit -s 'from otherfile import edit_distance, revised_edit; x = range(4); y = range(3,7)' 'revised_edit(x, y)'
100000 loops, best of 3: 10.1 usec per loop
A 65% speedup for small lists. Let's see about (relatively) larger ones
original
python -m timeit -s 'from otherfile import edit_distance, revised_edit; x = range(400); y = range(300,700)' 'edit_distance(x, y)'
10 loops, best of 3: 152 msec per loop
revised
python -m timeit -s 'from otherfile import edit_distance, revised_edit; x = range(400); y = range(300,700)' 'revised_edit(x, y)'
1000 loops, best of 3: 1.31 msec per loop
The difference is starting to diverge, two entire orders of magnitude here. Whether or not you would use this on large lists is I suppose an implementation detail, but worth a look. Not to imply that a 400 element list is big, but it's an improvement for sure.
The timing does not include the time required to reverse edits
using a edits[::-1]
because that is going to be a constant, and looking at slicing a 400 element list in this way, the time is insignificant compared to the rest of the algorithm (1000000 loops, best of 3: 0.967 usec per loop
for range(400)
, if you were curious).
return edit
You never return edits
! print
does not return. It's also advisable to change print val
to print(val)
, as the latter is portable across python 2 and 3, and it's recommended you be on python3 already anyways. So at the end of your function, be sure to include:
return edits[::-1]