Given two strings, $r$ and $s,ドル where $n = |r|,ドル $m = |s|$ and $m \ll n,ドル find the minimum edit distance between $s$ for each beginning position in $r$ efficiently.
That is, for each suffix of $r$ beginning at position $k,ドル $r_k,ドル find the Levenshtein distance of $r_k$ and $s$ for each $k \in [0, |r|-1]$. In other words, I would like an array of scores, $A,ドル such that each position, $A[k],ドル corresponds to the score of $r_k$ and $s$.
The obvious solution is to use the standard dynamic programming solution for each $r_k$ against $s$ considered separately, but this has the abysmal running time of $O(n m^2)$ (or $O(n d^2),ドル where $d$ is the maximum edit distance). It seems like you should be able to re-use the information that you've computed for $r_0$ against $s$ for the comparison with $s$ and $r_1$.
I've thought of constructing a prefix tree and then trying to do dynamic programming algorithm on $s$ against the trie, but this still has worst case $O(n d^2)$ (where $d$ is the maximum edit distance) as the trie is only optimized for efficient lookup.
Ideally I would like something that has worst case running time of $O(n d)$ though I would settle for good average case running time. Does anyone have any suggestions? Is $O(n d^2)$ the best you can do, in general?
Here are some links that might be relevant though I can't see how they would apply to the above problem as most of them are optimized for lookup only:
- Fast and Easy Levensthein distance using a Trie
- SO: Most efficient way to calculate Levenshtein distance
- SO: Levenshtein Distance Algoirthm better than $O(n m)$
- An extension of Ukkonen's enhanced dynamic programming ASM algorithm
- Damn Cool Algorithms: Levenshtein Automata
I've also heard some talk about using some type of distance metric to optimize search (such as a BK-tree?) but I know little about this area and how it applies to this problem.
2 Answers 2
What you are interested in are semi-global and/or local alignments. The usual way to compute those is to adapt the dynamic programing algorithm for the Levenshtein distance:
- Initialise the first row/column with 0ドル$ (instead of $i$/$j$) if free deletions/insertions are allowed at the beginning.
- Select the minimum value from the last row/column as result if free deletions/insertions are allowed at the end.
Different combinations are possible. In your case, assume that $r$ is the horizontal word, you initialise the first row with 0ドル$ (the first column with $i,ドル no change here) and the result is the minimum of the last row. The result is then the smallest Levenshtein distance between $s$ and any substring of $r,ドル computed in time and space $O(nm)$.
Now, in order to get scores for all starting positions, note that the computation of semi-global alignments with allowed deletions/insertions at the end conveniently provides the wanted array in the last row/column -- well, almost: that gives you the best matches against prefixes up to position $k$. So in order to get best matches against suffixes from position $k,ドル reverse both $r$ and $s$.
-
$\begingroup$ please read the question more carefully. I do not want a local alignment method. I do not want the best match of $s$ in $r$. I want all scores of of $s$ against $r_k,ドル where $r_k$ is the suffix string of $r$ starting at $k$. $\endgroup$user834– user8342012年06月28日 14:42:22 +00:00Commented Jun 28, 2012 at 14:42
-
$\begingroup$ @user834: Ah, ok. Well, that's an easy extension; see my updated answer. $\endgroup$Raphael– Raphael2012年06月28日 14:58:27 +00:00Commented Jun 28, 2012 at 14:58
-
1$\begingroup$ If you're only interested in the score, and not the actual alignments, you can get away with $\mathcal O(n)$ space, i.e. update the alignment row-wise but keep only the latest row, which is really all you need. $\endgroup$Pedro– Pedro2012年06月28日 16:02:54 +00:00Commented Jun 28, 2012 at 16:02
-
$\begingroup$ @pedro it turns out you can also retrieve the sequence in $O(n)$ space using Hirschberg's algorithm. If you recursively split the problem up and determine where the path must pass through in the subdivision, you can emit path points as you go. $\endgroup$user834– user8342015年12月02日 03:01:14 +00:00Commented Dec 2, 2015 at 3:01
Run the usual DP on rev(r) and rev(s). The answer for the length-k suffix is in the entry (k, m) of the last column, assuming that the table is indexed as (index in r, index in s).
-
$\begingroup$ I am not looking for the suffix of $r$ that is closest to $s$. I am not looking for the closest substring in $r$ that matches $s$. I am looking for the the score of each suffix or $r$ against $s$ (i.e. I'm looking for an array of scores). $\endgroup$user834– user8342012年06月28日 06:34:07 +00:00Commented Jun 28, 2012 at 6:34
-
$\begingroup$ @user834: That is not what you wrote in the question; you should edit it accordingly. $\endgroup$Raphael– Raphael2012年06月28日 12:22:14 +00:00Commented Jun 28, 2012 at 12:22
-
$\begingroup$ @Raphael, yes it is, please read the question more carefully. I have added the statement about wanting an array of scores explicitly. $\endgroup$user834– user8342012年06月28日 14:39:22 +00:00Commented Jun 28, 2012 at 14:39
Explore related questions
See similar questions with these tags.