I have modified a Z-Algorithm for string similarity in order to match strings which differ in 0 or 1 char.
However, it seems much slower than original Z-Algorithm although from my point of view there is only a minor different and it also has a O(n) time complexity.
Input is a string
in which a pattern
is going to be found. So the goal is to find parts in string
where pattern
is present. The goal is also to find parts in string
where the pattern
is 'almost present'. So peep
, leek
and peek
are all 'present' in string = "blahpeekblahpeepblahleek"
for pattern = "peek"
. The output are indices in which matches are found. So for my example the output is "4 12 20"
.
modified Z-Algorithm:
static String stringSimilarity(String string, String pattern ) {
String s = pattern + "$" + string;
int len = s.length();
int patternLen = pattern .length();
int[] z = new int[len];
int l, r, i;
StringBuilder result = new StringBuilder();
l = r = patternLen;
i = patternLen + 1;
while (i < len) {
//O(1)
if (i <= r)
z[i] = Math.min(z[i - l], r - i + 1);
while (i + z[i] < len && z[i] < patternLen && s.charAt(z[i]) == s.charAt(i + z[i]))
z[i] += 1;
// if the previous while encountered mismatch we accept the mismatch as match and seek for matches once more
if (i + z[i] < len && z[i] < patternLen && s.charAt(z[i]) != s.charAt(i + z[i])) {
z[i] += 1;
while (i + z[i] < len && z[i] < patternLen && s.charAt(z[i]) == s.charAt(i + z[i]))
z[i] += 1;
}
if (z[i] >= patternLen)
result.append(i - patternLen - 1).append(" ");
if (i + z[i] - 1 > r) {
l = i;
r = i + z[i] - 1;
}
i += 1;
}
return result.toString();
}
Z-Algorithm body (code is not entirely mine but I adopted it while mine was not that readable):
while (i < len) { //O(n + n) = O(n)
//O(1)
if (i <= r)
z[i] = Math.min(z[i - l], r - i + 1);
//loops only n times at maximum - O(n)
while (i + z[i] < len && s.charAt(z[i]) == s.charAt(i + z[i]))
z[i] += 1;
if (z[i] >= patternLen)
result.append(i - patternLen- 1).append(" ");
//O(1)
if (i + z[i] - 1 > r) {
l = i;
r = i + z[i] - 1;
}
i += 1; //O(1)
}
Can you please check the complexity of my modified Z-Algorithm?
From my point of view it is the same, however, tests on Hackerank show timeout.
1 Answer 1
Why it is slow
In your modified Z-algorithm, you start computing the Z array from i = patternLen + 1
. This means that for the whole pattern, your Z array will have value 0. This defeats the purpose of the Z-algorithm. Even if you removed the part where you ignore one mismatched character, this will have changed your search from a \$O(n)\$ search to a \$O(n*m)\$ search, where \$m\$ is the length of the pattern.
For example, I did a test where I searched a string containing 1 million 'a'
in a row for a pattern containing 1000 'a'
in a row. This took 4 seconds. When I changed the code to start at the beginning of the pattern (i.e. the real Z-algorithm), it took 0.08 seconds.
But...
You can't just start at i=1
and expect to fix the problem. Your "ignore one mismatch" code requires that you start over from the beginning every time because you can't use partial substring matches like you can with the Z-algorithm. The reason is you may have skipped over a mismatch, so the value in the Z array doesn't necessary correspond to the substring that you matched.
So in other words, perhaps Z-algortithm is not well suited for this problem, or at least you will need to modify it in a different way.
-
\$\begingroup\$ Lol! Firstly I thought that "when starting at
i = patternLen + 1
I will save some iterations" :P but I can see it now. \$\endgroup\$bobo– bobo2016年10月02日 22:09:36 +00:00Commented Oct 2, 2016 at 22:09 -
\$\begingroup\$ I understand now why I the algo has to start over again - if my algo puts e.g.
1
as a starting z-value for somez[i]
, it will skip thei-th
char cos it assumes it is matched. However, in reality thei-th
char can be a mismatch but the algo does not really know about it. So when some mismatch is found later on, the algo will consider this as a first mismatch a continues with the second while. Which is not correct. \$\endgroup\$bobo– bobo2016年10月02日 22:44:04 +00:00Commented Oct 2, 2016 at 22:44
res.toString()
which should beresult.toString()
\$\endgroup\$peek
matchespeek
but also matchespeep
orleek
orpeck
... \$\endgroup\$String
rather than (as might be expected) aboolean
. \$\endgroup\$