After optimising my code it still seems pretty brute force and I am looking for ideas to fundamentally improve it.
I have tried improving performance by:
- Avoiding creating substrings as much as possible - using while loop to compare string instead.
- Intelligently anticipating ends ie. start + shift + max < str.length() - 1
- use int instead of Integers
Is there anything obvious I have missed? I have found a few alternatives solutions on the internet and would love to hear an opinion how they are different/better/worse to my implementation:
- Princeton http://introcs.cs.princeton.edu/java/42sort/LRS.java.html
- Boyer–Moore string search algorithm
- Or any of the other string search implementations (the most notable can be found on wikipedia)
And before I forget this is my code:
public static String findLongestRepeatedSubString(String str) {
int max = 0;
String result = "";
for (int start = 0; start + max < str.length() - 1; start++) {
for (int shift = 1; start + shift + max < str.length() - 1; shift++) {
int length = 0;
// While characters match count the length
while(str.charAt(start + length) == str.charAt(start + shift + length) // matching characters
&& start + shift + length < str.length() - 1){ // hasn't reached the end yet
length++;
}
// If the length is larger - update the new max size
if (length > max ) {
max = length;
result = str.substring(start, start + length + 1);
}
}
}
return result;
}
1 Answer 1
Off-by-one error
You have an off-by-one error in the innermost while loop: for input string "aa", this will not reach length 1, and give empty string as the result, which is incorrect.
I looked at this loop with suspicion, because I noticed that it checks the string bound in the second condition. The more natural code organization is to check the bound first, and then the second condition &&-ed to the first will be safe.
I suggest to rearrange that way, and for that you will need to rethink and verify all the indexes and ranges there. Which can be pretty hard to get right. So at this point, before touching anything, I suggest to add a bunch of Junit tests first, try to cover all the corner cases.
More bugs
I tried a bunch of random examples, and the program gives incorrect results:
- "abac" -> "ab"
- "abcac" -> "ab"
- "abcba" -> "bc"
As I suggested in the previous point, it will be good to add Junit tests to verify the logic.
Longest repeated substring problem
Quoting from Wikipedia:
This problem can be solved in linear time and space by building a suffix tree for the string, and finding the deepest internal node in the tree. Depth is measured by the number of characters traversed from the root.
I think you should look into that and abandon your current implementation.
-
\$\begingroup\$ Thanks the suffix tree looks like a great way to solve this. \$\endgroup\$Paul– Paul2015年06月27日 04:07:44 +00:00Commented Jun 27, 2015 at 4:07
-
1\$\begingroup\$ There is a great explanation of using suffix trees for this problem at geeksforgeeks.org/… \$\endgroup\$Paul– Paul2015年06月27日 04:38:29 +00:00Commented Jun 27, 2015 at 4:38