Finding the longest repeated substring in a string

Question 1

After optimising my code it still seems pretty brute force and I am looking for ideas to fundamentally improve it.

I have tried improving performance by:

Avoiding creating substrings as much as possible - using while loop to compare string instead.
Intelligently anticipating ends ie. start + shift + max < str.length() - 1
use int instead of Integers

Is there anything obvious I have missed? I have found a few alternatives solutions on the internet and would love to hear an opinion how they are different/better/worse to my implementation:

Princeton http://introcs.cs.princeton.edu/java/42sort/LRS.java.html
Boyer–Moore string search algorithm
Or any of the other string search implementations (the most notable can be found on wikipedia)

And before I forget this is my code:

 public static String findLongestRepeatedSubString(String str) {
 int max = 0;
 String result = "";
 for (int start = 0; start + max < str.length() - 1; start++) {
 for (int shift = 1; start + shift + max < str.length() - 1; shift++) {
 int length = 0;
 // While characters match count the length
 while(str.charAt(start + length) == str.charAt(start + shift + length) // matching characters
 && start + shift + length < str.length() - 1){ // hasn't reached the end yet
 length++;
 }
 // If the length is larger - update the new max size
 if (length > max ) {
 max = length;
 result = str.substring(start, start + length + 1);
 }
 }
 }
 return result;
}

Question 2

Off-by-one error

You have an off-by-one error in the innermost while loop: for input string "aa", this will not reach length 1, and give empty string as the result, which is incorrect.

I looked at this loop with suspicion, because I noticed that it checks the string bound in the second condition. The more natural code organization is to check the bound first, and then the second condition &&-ed to the first will be safe.

I suggest to rearrange that way, and for that you will need to rethink and verify all the indexes and ranges there. Which can be pretty hard to get right. So at this point, before touching anything, I suggest to add a bunch of Junit tests first, try to cover all the corner cases.

More bugs

I tried a bunch of random examples, and the program gives incorrect results:

"abac" -> "ab"
"abcac" -> "ab"
"abcba" -> "bc"

As I suggested in the previous point, it will be good to add Junit tests to verify the logic.

Longest repeated substring problem

Quoting from Wikipedia:

This problem can be solved in linear time and space by building a suffix tree for the string, and finding the deepest internal node in the tree. Depth is measured by the number of characters traversed from the root.

I think you should look into that and abandon your current implementation.

Question 3

Thanks the suffix tree looks like a great way to solve this.

Question 4

There is a great explanation of using suffix trees for this problem at geeksforgeeks.org/…

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2015-06-26 07:13:42Z

Off-by-one error

You have an off-by-one error in the innermost while loop: for input string "aa", this will not reach length 1, and give empty string as the result, which is incorrect.

I looked at this loop with suspicion, because I noticed that it checks the string bound in the second condition. The more natural code organization is to check the bound first, and then the second condition &&-ed to the first will be safe.

I suggest to rearrange that way, and for that you will need to rethink and verify all the indexes and ranges there. Which can be pretty hard to get right. So at this point, before touching anything, I suggest to add a bunch of Junit tests first, try to cover all the corner cases.

More bugs

I tried a bunch of random examples, and the program gives incorrect results:

"abac" -> "ab"
"abcac" -> "ab"
"abcba" -> "bc"

As I suggested in the previous point, it will be good to add Junit tests to verify the logic.

Longest repeated substring problem

Quoting from Wikipedia:

This problem can be solved in linear time and space by building a suffix tree for the string, and finding the deepest internal node in the tree. Depth is measured by the number of characters traversed from the root.

I think you should look into that and abandon your current implementation.

Thanks the suffix tree looks like a great way to solve this.
There is a great explanation of using suffix trees for this problem at geeksforgeeks.org/…

Stack Exchange Network

Finding the longest repeated substring in a string

1 Answer 1

Off-by-one error

More bugs

Longest repeated substring problem

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Finding the longest repeated substring in a string

1 Answer 1

Off-by-one error

More bugs

Longest repeated substring problem

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions