Efficiently calculating minimum edit distance of a smaller string at each position in a larger one

Question 1

Given two strings, $r$ and $s,ドル where $n = |r|,ドル $m = |s|$ and $m \ll n,ドル find the minimum edit distance between $s$ for each beginning position in $r$ efficiently.

That is, for each suffix of $r$ beginning at position $k,ドル $r_k,ドル find the Levenshtein distance of $r_k$ and $s$ for each $k \in [0, |r|-1]$. In other words, I would like an array of scores, $A,ドル such that each position, $A[k],ドル corresponds to the score of $r_k$ and $s$.

The obvious solution is to use the standard dynamic programming solution for each $r_k$ against $s$ considered separately, but this has the abysmal running time of $O(n m^2)$ (or $O(n d^2),ドル where $d$ is the maximum edit distance). It seems like you should be able to re-use the information that you've computed for $r_0$ against $s$ for the comparison with $s$ and $r_1$.

I've thought of constructing a prefix tree and then trying to do dynamic programming algorithm on $s$ against the trie, but this still has worst case $O(n d^2)$ (where $d$ is the maximum edit distance) as the trie is only optimized for efficient lookup.

Ideally I would like something that has worst case running time of $O(n d)$ though I would settle for good average case running time. Does anyone have any suggestions? Is $O(n d^2)$ the best you can do, in general?

Here are some links that might be relevant though I can't see how they would apply to the above problem as most of them are optimized for lookup only:

I've also heard some talk about using some type of distance metric to optimize search (such as a BK-tree?) but I know little about this area and how it applies to this problem.

Question 2

What you are interested in are semi-global and/or local alignments. The usual way to compute those is to adapt the dynamic programing algorithm for the Levenshtein distance:

Initialise the first row/column with 0ドル$ (instead of $i$/$j$) if free deletions/insertions are allowed at the beginning.
Select the minimum value from the last row/column as result if free deletions/insertions are allowed at the end.

Different combinations are possible. In your case, assume that $r$ is the horizontal word, you initialise the first row with 0ドル$ (the first column with $i,ドル no change here) and the result is the minimum of the last row. The result is then the smallest Levenshtein distance between $s$ and any substring of $r,ドル computed in time and space $O(nm)$.

Now, in order to get scores for all starting positions, note that the computation of semi-global alignments with allowed deletions/insertions at the end conveniently provides the wanted array in the last row/column -- well, almost: that gives you the best matches against prefixes up to position $k$. So in order to get best matches against suffixes from position $k,ドル reverse both $r$ and $s$.

Question 3

please read the question more carefully. I do not want a local alignment method. I do not want the best match of $s$ in $r$. I want all scores of of $s$ against $r_k,ドル where $r_k$ is the suffix string of $r$ starting at $k$.

Question 4

@user834: Ah, ok. Well, that's an easy extension; see my updated answer.

Question 5

If you're only interested in the score, and not the actual alignments, you can get away with $\mathcal O(n)$ space, i.e. update the alignment row-wise but keep only the latest row, which is really all you need.

Question 6

@pedro it turns out you can also retrieve the sequence in $O(n)$ space using Hirschberg's algorithm. If you recursively split the problem up and determine where the path must pass through in the subdivision, you can emit path points as you go.

Question 7

Run the usual DP on rev(r) and rev(s). The answer for the length-k suffix is in the entry (k, m) of the last column, assuming that the table is indexed as (index in r, index in s).

Question 8

I am not looking for the suffix of $r$ that is closest to $s$. I am not looking for the closest substring in $r$ that matches $s$. I am looking for the the score of each suffix or $r$ against $s$ (i.e. I'm looking for an array of scores).

Question 9

@user834: That is not what you wrote in the question; you should edit it accordingly.

Question 10

@Raphael, yes it is, please read the question more carefully. I have added the statement about wanting an array of scores explicitly.

Raphael Raphael 73.3k31 gold badges183 silver badges403 bronze badges · Accepted Answer · 2012-06-28 12:29:41Z

What you are interested in are semi-global and/or local alignments. The usual way to compute those is to adapt the dynamic programing algorithm for the Levenshtein distance:

Initialise the first row/column with 0ドル$ (instead of $i$/$j$) if free deletions/insertions are allowed at the beginning.
Select the minimum value from the last row/column as result if free deletions/insertions are allowed at the end.

Different combinations are possible. In your case, assume that $r$ is the horizontal word, you initialise the first row with 0ドル$ (the first column with $i,ドル no change here) and the result is the minimum of the last row. The result is then the smallest Levenshtein distance between $s$ and any substring of $r,ドル computed in time and space $O(nm)$.

Now, in order to get scores for all starting positions, note that the computation of semi-global alignments with allowed deletions/insertions at the end conveniently provides the wanted array in the last row/column -- well, almost: that gives you the best matches against prefixes up to position $k$. So in order to get best matches against suffixes from position $k,ドル reverse both $r$ and $s$.

please read the question more carefully. I do not want a local alignment method. I do not want the best match of $s$ in $r$. I want all scores of of $s$ against $r_k,ドル where $r_k$ is the suffix string of $r$ starting at $k$.
@user834: Ah, ok. Well, that's an easy extension; see my updated answer.
If you're only interested in the score, and not the actual alignments, you can get away with $\mathcal O(n)$ space, i.e. update the alignment row-wise but keep only the latest row, which is really all you need.
@pedro it turns out you can also retrieve the sequence in $O(n)$ space using Hirschberg's algorithm. If you recursively split the problem up and determine where the path must pass through in the subdivision, you can emit path points as you go.

Stack Exchange Network

Efficiently calculating minimum edit distance of a smaller string at each position in a larger one

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Efficiently calculating minimum edit distance of a smaller string at each position in a larger one

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions