Modified Z-Algorithm for string similarity with one char difference

Question 1

I have modified a Z-Algorithm for string similarity in order to match strings which differ in 0 or 1 char.

However, it seems much slower than original Z-Algorithm although from my point of view there is only a minor different and it also has a O(n) time complexity.

Input is a string in which a pattern is going to be found. So the goal is to find parts in string where pattern is present. The goal is also to find parts in string where the pattern is 'almost present'. So peep, leek and peek are all 'present' in string = "blahpeekblahpeepblahleek" for pattern = "peek". The output are indices in which matches are found. So for my example the output is "4 12 20".

modified Z-Algorithm:

static String stringSimilarity(String string, String pattern ) {
String s = pattern + "$" + string;
int len = s.length();
int patternLen = pattern .length();
int[] z = new int[len];
int l, r, i;
StringBuilder result = new StringBuilder();
l = r = patternLen; 
i = patternLen + 1; 
while (i < len) {
 //O(1)
 if (i <= r)
 z[i] = Math.min(z[i - l], r - i + 1);
 while (i + z[i] < len && z[i] < patternLen && s.charAt(z[i]) == s.charAt(i + z[i]))
 z[i] += 1;
 // if the previous while encountered mismatch we accept the mismatch as match and seek for matches once more
 if (i + z[i] < len && z[i] < patternLen && s.charAt(z[i]) != s.charAt(i + z[i])) {
 z[i] += 1;
 while (i + z[i] < len && z[i] < patternLen && s.charAt(z[i]) == s.charAt(i + z[i]))
 z[i] += 1;
 }
 if (z[i] >= patternLen)
 result.append(i - patternLen - 1).append(" ");
 if (i + z[i] - 1 > r) { 
 l = i; 
 r = i + z[i] - 1;
 }
 i += 1;
}
return result.toString();
}

Z-Algorithm body (code is not entirely mine but I adopted it while mine was not that readable):

while (i < len) { //O(n + n) = O(n)
 //O(1)
 if (i <= r) 
 z[i] = Math.min(z[i - l], r - i + 1);
 //loops only n times at maximum - O(n)
 while (i + z[i] < len && s.charAt(z[i]) == s.charAt(i + z[i]))
 z[i] += 1;
 if (z[i] >= patternLen)
 result.append(i - patternLen- 1).append(" ");
 //O(1)
 if (i + z[i] - 1 > r) { 
 l = i; 
 r = i + z[i] - 1;
 } 
 i += 1; //O(1)
}

Can you please check the complexity of my modified Z-Algorithm?

From my point of view it is the same, however, tests on Hackerank show timeout.

Question 2

The method does not build due to res.toString() which should be result.toString()

Question 3

that wasn't the only place.

Question 4

What does "in order to match strings which differ in 0 or 1 char" mean? Without a clear explanation of what the code is supposed to be doing, it's hard to review it properly.

Question 5

@PeterTaylor Yeah can be hard to follow that one. But I guess that the string may may have 1 character different from the pattern. For instance peek matches peek but also matches peep or leekor peck...

Question 6

@BrunoCosta, that's one possible explanation. Another is that it may have a Levenshtein distance of 1, counting a replacement as 1 rather than 2. But regardless of what differences are permitted, the issue of inputs and outputs is also less than clear: the return value is a String rather than (as might be expected) a boolean.

Question 7

Why it is slow

In your modified Z-algorithm, you start computing the Z array from i = patternLen + 1. This means that for the whole pattern, your Z array will have value 0. This defeats the purpose of the Z-algorithm. Even if you removed the part where you ignore one mismatched character, this will have changed your search from a \$O(n)\$ search to a \$O(n*m)\$ search, where \$m\$ is the length of the pattern.

For example, I did a test where I searched a string containing 1 million 'a' in a row for a pattern containing 1000 'a' in a row. This took 4 seconds. When I changed the code to start at the beginning of the pattern (i.e. the real Z-algorithm), it took 0.08 seconds.

But...

You can't just start at i=1 and expect to fix the problem. Your "ignore one mismatch" code requires that you start over from the beginning every time because you can't use partial substring matches like you can with the Z-algorithm. The reason is you may have skipped over a mismatch, so the value in the Z array doesn't necessary correspond to the substring that you matched.

So in other words, perhaps Z-algortithm is not well suited for this problem, or at least you will need to modify it in a different way.

Question 8

Lol! Firstly I thought that "when starting at i = patternLen + 1 I will save some iterations" :P but I can see it now.

Question 9

I understand now why I the algo has to start over again - if my algo puts e.g. 1 as a starting z-value for some z[i], it will skip the i-th char cos it assumes it is matched. However, in reality the i-th char can be a mismatch but the algo does not really know about it. So when some mismatch is found later on, the algo will consider this as a first mismatch a continues with the second while. Which is not correct.

JS1 JS1 28.8k3 gold badges41 silver badges83 bronze badges · Accepted Answer · 2016-09-30 21:37:15Z

Why it is slow

In your modified Z-algorithm, you start computing the Z array from i = patternLen + 1. This means that for the whole pattern, your Z array will have value 0. This defeats the purpose of the Z-algorithm. Even if you removed the part where you ignore one mismatched character, this will have changed your search from a \$O(n)\$ search to a \$O(n*m)\$ search, where \$m\$ is the length of the pattern.

For example, I did a test where I searched a string containing 1 million 'a' in a row for a pattern containing 1000 'a' in a row. This took 4 seconds. When I changed the code to start at the beginning of the pattern (i.e. the real Z-algorithm), it took 0.08 seconds.

But...

You can't just start at i=1 and expect to fix the problem. Your "ignore one mismatch" code requires that you start over from the beginning every time because you can't use partial substring matches like you can with the Z-algorithm. The reason is you may have skipped over a mismatch, so the value in the Z array doesn't necessary correspond to the substring that you matched.

So in other words, perhaps Z-algortithm is not well suited for this problem, or at least you will need to modify it in a different way.

Lol! Firstly I thought that "when starting at i = patternLen + 1 I will save some iterations" :P but I can see it now.
I understand now why I the algo has to start over again - if my algo puts e.g. 1 as a starting z-value for some z[i], it will skip the i-th char cos it assumes it is matched. However, in reality the i-th char can be a mismatch but the algo does not really know about it. So when some mismatch is found later on, the algo will consider this as a first mismatch a continues with the second while. Which is not correct.

Stack Exchange Network

Modified Z-Algorithm for string similarity with one char difference

1 Answer 1

Why it is slow

But...

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Modified Z-Algorithm for string similarity with one char difference

1 Answer 1

Why it is slow

But...

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions