I have been using a slightly modified Hamming Distance algorithm for approximate String Matching for patterns and wondering if there is something better out there. The t being the length of the text and p being the length of the pattern the worst case is roughly O(t*p). Which from looking at other Fuzzy String matching seems to be in norm.
final int mismatches = 1;
final String text = "bubbles";
final String pattern = "bu";
for(int iter = 0; iter < text.length() - pattern.length() + 1; iter++)
{
int missed = 0;
int ator = 0;
do
{
if(text.charAt(iter + ator) != pattern.charAt(ator))
{
missed++;
}
}while(++ator < pattern.length() && missed <= mismatches);
if(missed <= mismatches)
{
System.out.println("Index: " + iter + " Pattern: " + text.substring(iter, iter + pattern.length()));
}
}
The output being indexes 0 bu, 2 bb, and 3 bl. The last two being mismatches with the tolerance of 1.
-
\$\begingroup\$ Please describe your slight modification: Hamming distance is only defined for equal-length strings. \$\endgroup\$Konrad Rudolph– Konrad Rudolph2012年06月16日 09:12:06 +00:00Commented Jun 16, 2012 at 9:12
-
\$\begingroup\$ Furthermore, what’s the question here? If you just want your code reviewed – it looks good, there’s nothing obvious that I’d change, except obviously to put the functionality in its own method. \$\endgroup\$Konrad Rudolph– Konrad Rudolph2012年06月16日 09:54:53 +00:00Commented Jun 16, 2012 at 9:54
-
\$\begingroup\$ The modification is that it can work for Strings of different lengths. Sorry I did not put that much emphasis on the question. For this type of pattern matching with substations but not also insertions (like a Levenshtein Distance), is this code reasonable, is there something that could be changed, or is there a more appropriate algorithm? \$\endgroup\$ntin– ntin2012年06月16日 17:25:31 +00:00Commented Jun 16, 2012 at 17:25
-
\$\begingroup\$ Looks like 'Save Humanity' Challenge and already answered here. \$\endgroup\$cat_baxter– cat_baxter2012年07月13日 14:29:03 +00:00Commented Jul 13, 2012 at 14:29
1 Answer 1
It looks fine. Some readability improvements:
I'd rename some variables first:
mismatches
should betolerance
,mismatchTolerance
orallowedMismatches
. For memismatches
sounds like a counter but it never changes which is confusing somehow.iter
andator
do not make the code easier to read.textIndex
(maybesearchIndex
) andpatternIndex
would be better since they are more meaningful and explains the purpose.
Then extract out some named variables like
final int textIndexMax = text.length() - pattern.length() + 1;
final char textChar = text.charAt(textIndex + patternIndex);
final char patternChar = pattern.charAt(patternIndex);
final String match = text.substring(textIndex, textIndex + pattern.length());
Reference: Chapter 6. Composing Methods, Introduce Explaining Variable in Refactoring: Improving the Design of Existing Code by Martin Fowler
Put the result of the expression, or parts of the expression, in a temporary variable with a name that explains the purpose.
The condition of the
do-while
loop could be rewritten as a usual (and a not so complex, therefore an easier to understand and less error-prone)for (i = 0; i < x; i++)
loop and abreak
statement.
Here is my version:
final int mismatchTolerance = 1;
final String text = "bubbles";
final String pattern = "bu";
final int textIndexMax = text.length() - pattern.length() + 1;
for (int textIndex = 0; textIndex < textIndexMax; textIndex++) {
int missed = 0;
for (int patternIndex = 0; patternIndex < pattern.length(); patternIndex++) {
final char textChar = text.charAt(textIndex + patternIndex);
final char patternChar = pattern.charAt(patternIndex);
if (textChar != patternChar) {
missed++;
}
if (missed > mismatchTolerance) {
break;
}
}
if (missed <= mismatchTolerance) {
final String match = text.substring(textIndex, textIndex + pattern.length());
System.out.println("Index: " + textIndex + " Match: " + match);
}
}