I have a piece of code that calculates the edit distance between words and works, but it's apparently not fast enough.
ClosestWords.java:
import java.util.LinkedList;
import java.util.List;
public class ClosestWords {
LinkedList<String> closestWords = null;
int closestDistance = -1;
public int dynamicEditDistance(char[] str1, char[] str2){
int temp[][] = new int[str1.length+1][str2.length+1];
for(int i=0; i < temp[0].length; i++){
temp[0][i] = i;
}
for(int i=0; i < temp.length; i++){
temp[i][0] = i;
}
for(int i=1;i <=str1.length; i++){
for(int j=1; j <= str2.length; j++){
if(str1[i-1] == str2[j-1]){
temp[i][j] = temp[i-1][j-1];
}else{
temp[i][j] = 1 + Math.min(temp[i-1][j-1], Math.min(temp[i-1][j], temp[i][j-1]));
}
}
}
return temp[str1.length][str2.length];
}
public ClosestWords(String w, List<String> wordList) {
for (String s : wordList) {
int dist = dynamicEditDistance(w.toCharArray(), s.toCharArray());
if (dist < closestDistance || closestDistance == -1) {
closestDistance = dist;
closestWords = new LinkedList<String>();
closestWords.add(s);
}
else if (dist == closestDistance)
closestWords.add(s);
}
}
int getMinDistance() {
return closestDistance;
}
List<String> getClosestWords() {
return closestWords;
}
}
Main.java:
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.LinkedList;
import java.util.List;
public class Main {
public static List<String> readWordList(BufferedReader input) throws IOException {
LinkedList<String> list = new LinkedList<String>();
while (true) {
String s = input.readLine();
if (s.equals("#"))
break;
list.add(s);
}
return list;
}
public static void main(String args[]) throws IOException {
BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
List<String> wordList = readWordList(stdin);
String word;
while ((word = stdin.readLine()) != null) {
ClosestWords closestWords = new ClosestWords(word, wordList);
System.out.print(word + " (" + closestWords.getMinDistance() + ")");
for (String w : closestWords.getClosestWords())
System.out.print(" " + w);
System.out.println();
}
}
}
Example input:
skål
skålar
skålen
skålens
skålform
#
kål
k
b
sklfrm
skala
Example Output:
kål (1) skål
k (3) skål
b (4) skål
sklfrm (2) skålform
skala (2) skål skålar
The upper part of the input are all the "correct" words and the lower part of the input (below the '#') are the words that need to be corrected. The output lines are of the form:
misspeltWord (minEditDistance)
listOfPossibleCorrectWordsThatShareTheSameEditDistance
Is there a way to make this code more efficient? I am running the code through a sort of "speed test" and it keeps failing on the last part of the test.
1 Answer 1
Preliminaries: there's strategy, and there's tactics.
A somewhat common procedure to tackle performance problems is to look at "inner" loops first - not entirely wrong, but the golden rule is
measure.
(And, when turning to others for support, provide measurement results and a test data generator or test data.)
Some statements regarding edit distance:
• difference in length gives a lower bound on insertions+deletions
• accumulated differences in frequency gives a lower bound on 2*replacements+insertions+deletions
Review proper using m = len(str1) and n = len(str2):
- Document your code. In the code.
- design
• I don't like heavyweight constructors
From the preparation possible, I'd prefer a constructor taking a set of words and
• a "query function" with a word as a parameter (String
rather thanchar[]
) -
too bad returning a multipart result gets verbose, ugly or both in Java - program against interfaces, not implementations
List<String> closestWords;
dynamicEditDistance()
- I can guess what's dynamic about it, but that's an implementation detail; such does not belong in a method name:
editDistance()
- does not use any instance member: make it
static
- with "the usual" cost model you don't need a full m×ばつn array
- If your edit cost is symmetric (cost(insertion) == cost(deletion) && cost(replace(a, b)) == cost(replace(b, a))), you don't need previous row(s) and previous column(s).
- you iterate the first index in the outer loop and the second one in the inner:
That's the sequence I'd arrange initialisation
(I'd even usej
in the 2nd loop)
- I can guess what's dynamic about it, but that's an implementation detail; such does not belong in a method name:
- work currently done in
ClosestWords()
:- code the way you think about the procedure/solution
- I'd thinkw
doesn't change, let's get the chars exactly once
(a nifty language system might be doing this for you) - prefer
List.clear()
over instantiation - (today,) I'd prefer redundantly checking for minimum distance over repeating the
add()
:
- code the way you think about the procedure/solution
closestWords = new LinkedList<>();
closestDistance = Integer.MAX_VALUE;
char[] chars = w.toCharArray();
for (String s : wordList) {
int dist = editDistance(chars, s.toCharArray());
if (dist < closestDistance) {
closestDistance = dist;
closestWords.clear();
}
if (dist == closestDistance)
closestWords.add(s);
}
A "slightly" weirder approach is to handle words from the word list in order of length, first descending from same length, then above and increasing; terminating both when "more extremal length words" can't possibly have a smaller edit distance.
Trying to avoid duplicating the code now extracted as handleDistance()
"in line" got out of hand - not pleased, still.
Don't do like I do (not documenting tally
(, words
) & init()
),
do like I say (better than handleDistance()
& query()
, still)
final Comparator<String> tally = new Comparator<String>() { @Override
public int compare(String l, String r) {
if (l.equals(r))
return 0;
final int ll = l.length(), rl = r.length();
return ll < rl ? -1
: rl < ll ? 1
: l.compareTo(r);
}
};
String[] words;
void init(Collection<String> allWords) {
words = allWords.toArray(NOSTRINGS);
Arrays.sort(words, tally);
}
/** handles the distance between one pair of <code>String</code>s
* updating <code>closestDistance</code> and <code>closestWords</code>
* @param chars chars of the query <code>String</code>
* @param s <code>String</code> from <code>words</code>
*/
void handleDistance(final char[] chars, String s) {
// System.out.println(">>>" + s + '<');
final int dist = editDistance(chars, s.toCharArray());
if (dist < closestDistance) {
closestDistance = dist;
closestWords.clear();
}
if (dist == closestDistance)
closestWords.add(s);
}
/** queries <code>words</code> for lowest edit distance to <code>w</code>
* updating <code>closestDistance</code> and <code>closestWords</code>
* @param w <code>String</code> to find closest words to
* @return closest words
*/
public Collection<String> query(String w) {
final char[] chars = w.toCharArray();
int sameLength = Arrays.binarySearch(words, w, tally);
if (0 <= sameLength) {
closestDistance = 0;
return closestWords = Collections.singletonList(words[sameLength]);
}
closestDistance = Integer.MAX_VALUE;
sameLength = -sameLength; // insert index + 1
for (int i = sameLength ; 0 <= --i ; ) {
final String s = words[i];
if (closestDistance <= chars.length - s.length())
break;
handleDistance(chars, s);
}
for (int i = sameLength ; ++i < words.length ; ) {
final String s = words[i];
if (closestDistance <= s.length() - chars.length)
break;
handleDistance(chars, s);
}
return closestWords;
}
Explore related questions
See similar questions with these tags.
How can I optimize my "Edit Distance" algorithm/code?
which applies to too many questions on code review. \$\endgroup\$apparently not fast enough
the motivation to improve this code was stronger/more focused if the question spelled out what made lacking speed apparent. If this was a rating in some programming challenge: there is a tag programming-challenge. \$\endgroup\$