Determining if two Strings have common subtrings of a given length in Java

Question 1

Problem:

Given 2 strings, consider all the substrings within them of length len. Len will be 1 or more. Returns true if there are any such substrings which appear in both strings. Compute this in linear time using a HashSet.

I know as a matter of fact that there are many areas upon which my code could be improved. However, rather than asking other people to correct my code, I would much prefer to be given hints or general principles as to how my code could be improved. I think this way, me, and everyone that read this post could get more out of this exercise this way.

So could someone please provide me with some comments or suggestions? Critics of any level and kind are welcomed! Feel free to pick apart my code!

Class:

import java.util.HashSet;
import java.util.Set;
// CS108 HW1 -- String static methods
public class StringCode {
 public static boolean stringIntersect(String firstString, String secondString, int lengthOfSubstring) {
 boolean sameSubstringInBothStrings = false;
 Set<String> setOne = StringCode.getSubstringsSet(firstString, lengthOfSubstring);
 Set<String> setTwo = StringCode.getSubstringsSet(secondString, lengthOfSubstring);
 //compare setOne and setTwo to find out if there is any matching elements
 outerForLoop:
 for(String aSubstringInSetOne : setOne){
 for(String aSubstringInSetTwo : setTwo){
 if(aSubstringInSetOne.equals(aSubstringInSetTwo)){
 sameSubstringInBothStrings = true;
 break outerForLoop;
 }
 }
 }
 return sameSubstringInBothStrings;
 }
 static Set<String> getSubstringsSet(String aString, int lengthOfSubstring){
 Set<String> setOfSubstrings = new HashSet<>();
 if(aString.length() < lengthOfSubstring){
 return setOfSubstrings;
 }
 if(aString.length() == lengthOfSubstring){
 setOfSubstrings.add(aString);
 return setOfSubstrings;
 }
 char[] charArray = aString.toCharArray();
 //starting from the first index, going through aString 
 for(int i = 0; i <= aString.length() - lengthOfSubstring; i++){
 StringBuilder sb = new StringBuilder();
 //add each substring of length (lengthOfSubstring) to the setOfSubstrings. 
 for(int j = 0; j < lengthOfSubstring; j++){
 sb.append(charArray[i + j]);
 }
 setOfSubstrings.add(sb.toString());
 }
 return setOfSubstrings;
 }
}

Test:

import assign1.StringCode;
import static org.junit.Assert.*;
import org.junit.Test;
public class StringCodeTest {
 @Test
 public void testStringIntersectNormalStrings(){
 assertEquals(true, StringCode.stringIntersect("abcdickefg" , "zyxqdick", 4));
 }
 @Test
 public void testStringIntersectNormalStrings2(){
 assertEquals(false, StringCode.stringIntersect("abcdef", "zzzzzz", 1));
 }
 @Test
 public void testEmptyString(){
 assertEquals(false, StringCode.stringIntersect("abc", "", 2));
 }
 @Test
 public void testLengthOfSubstringGreaterThanTheStringItself(){
 assertEquals(false, StringCode.stringIntersect("abc", "xyz", 5));
 }
 @Test 
 public void testTwoEmptyStrings(){
 assertEquals(false, StringCode.stringIntersect("", "", 2));
 }
}

Question 2

Don't add unnecessary pre-checks

Your method getSubstringsSet starts with two early returns:

if(aString.length() < lengthOfSubstring){
 return setOfSubstrings;
}
if(aString.length() == lengthOfSubstring){
 setOfSubstrings.add(aString);
 return setOfSubstrings;
}

You actually don't need them. The rest of the code handles both cases just fine and it doesn't add to clarity or improve performance. Early-returns should be used when the rest of the code can then assume a condition that was checked, making its code simpler; but it is not the case here.

Use built-in methods

To add the substrings, you are currently doing:

StringBuilder sb = new StringBuilder();
//add each substring of length (lengthOfSubstring) to the setOfSubstrings. 
for(int j = 0; j < lengthOfSubstring; j++){
 sb.append(charArray[i + j]);
}
setOfSubstrings.add(sb.toString());

So you're creating a new StringBuilder and then appending each character. This is a complicated way of retrieving the substring between two indices. It also allocates a new StringBuilder each time which isn't really needed.

You can just have

setOfSubstrings.add(aString.substring(i, i + lengthOfSubstring));

In the same way, stringIntersect is a lot more complicated than it should be.

First of all, you are using a label outerForLoop and then using it to break out of the inner loop with break outerForLoop;. This is generally not a good practice. What this shows is a missing method: what you really want is to make a method isIntersection(set1, set2) that would determine if the two sets have elements in common.

However, you don't even need to write one, since it is already built-in: Collections#disjoint. This method returns true if the two given collections are disjoint, i.e. have no elements in common. Therefore, you can remove your double for loop and just have:

boolean sameSubstringInBothStrings = !Collections.disjoint(setOne, setTwo);

Question 3

it would not be linear time anyway. Comparing two string is o(n). Comparing m strings (what is the main issue of this strategy) is o(nm).

Question 4

@rdllopes Hmm are you sure? Building the two sets is linear and the disjoint call is also linear since contains is constant-time for sets. There could be an issue with substring though.

Question 5

Yes, the problem is with substring. Substring operation is O(n) (there is no hope here... substring is an arrayCopy , what is the fastest O(n) copy operation but yet O(n)) So building a set with all substrings is O(mn).

Question 6

But also, hashCode calculation of a String is O(n) and same for equals. So Collections.disjoint, retains, etc rely on the hashCode and equals methods. Executing those operation for all elements is O(n^2).

Question 7

Yes. Len is arbitrary value. It could be considered constant value or part of complexity. I think the question is considering the first interpretation and you are right.

Question 8

1

What comes to stringIntersect, you don't need to use sameSubstringInBothStrings. Instead, just return true or false as soon as you know the result of the method:

public static boolean stringIntersect(String firstString, String secondString, int lengthOfSubstring) {
 boolean sameSubstringInBothStrings = false;
 Set<String> setOne = StringCode.getSubstringsSet(firstString, lengthOfSubstring);
 Set<String> setTwo = StringCode.getSubstringsSet(secondString, lengthOfSubstring);
 //compare setOne and setTwo to find out if there is any matching elements
 outerForLoop:
 for(String aSubstringInSetOne : setOne){
 for(String aSubstringInSetTwo : setTwo){
 if(aSubstringInSetOne.equals(aSubstringInSetTwo)){
 return true;
 }
 }
 }
 return false;
}

2 Profile when in doubt

I don't see why your Set kicking preprocessing would be any better than a dumb brute force solution to the problem. I had this in mind...

import java.util.Random;
public class StringCodeV2 {
 public static boolean stringIntersect(String firstString, 
 String secondString, 
 int k) {
 checkSubstringSize(k);
 final char[] firstStringChars = firstString.toCharArray();
 final char[] secondStringChars = secondString.toCharArray();
 for (int firstStringStartIndex = 0; 
 firstStringStartIndex < firstStringChars.length - k + 1; 
 firstStringStartIndex++) {
 for (int secondStringStartIndex = 0;
 secondStringStartIndex < secondStringChars.length - k + 1; 
 secondStringStartIndex++) {
 int charsRead = 0;
 while (firstStringStartIndex + charsRead < 
 firstStringChars.length
 && 
 secondStringStartIndex + charsRead < 
 secondStringChars.length) {
 if (firstStringChars[firstStringStartIndex + charsRead] ==
 secondStringChars[secondStringStartIndex + charsRead]) {
 charsRead++;
 if (charsRead == k) {
 return true;
 }
 } else {
 break;
 }
 }
 }
 }
 return false;
 }
 private static void checkSubstringSize(int k) {
 if (k < 1) {
 throw new IllegalArgumentException(
 "The length of the substring is less than 1.");
 }
 }
 public static void main(String[] args) {
 System.out.println("[STATUS] Warming up...");
 warmup();
 System.out.println("[STATUS] Warming up done!");
 long seed = System.currentTimeMillis();
 Random random = new Random(seed);
 String a = getRandomString(5000, random);
 String b = getRandomString(5000, random);
 System.out.println("Seed = " + seed);
 long startTime = System.nanoTime();
 boolean intersect1 = StringCode.stringIntersect(a, b, 5);
 long endTime = System.nanoTime();
 System.out.println("StringCode.stringIntersect in " + 
 (endTime - startTime) / 1e6 + " milliseconds.");
 startTime = System.nanoTime();
 boolean intersect2 = StringCodeV2.stringIntersect(a, b, 5);
 endTime = System.nanoTime();
 System.out.println("StringCodeV2.stringIntersect in " + 
 (endTime - startTime) / 1e6 + " milliseconds.");
 System.out.println("Algorithms agree: " + (intersect1 == intersect2));
 }
 private static void warmup() {
 Random random = new Random();
 for (int i = 0; i < 100; ++i) {
 String a = getRandomString(1000, random);
 String b = getRandomString(1000, random);
 StringCode.stringIntersect(a, b, 300);
 StringCodeV2.stringIntersect(a, b, 300);
 }
 }
 private static String getRandomString(int size, Random random) {
 StringBuilder sb = new StringBuilder(size);
 for (int i = 0; i < size; ++i) {
 sb.append((char)(random.nextInt(26) + 'A'));
 }
 return sb.toString();
 }
}

... and it yields sometimes as optimistic peformance figures as the following:


[STATUS] Warming up...
[STATUS] Warming up done!
Seed = 1465063085270
StringCode.stringIntersect in 344.266806 milliseconds.
StringCodeV2.stringIntersect in 17.754162 milliseconds.
Algorithms agree: true

Hope that helps.

Question 9

The complexity of the OP's algorithm is O((n1+n2)*k) where n1 and n2 are the string lengths, the complexity of yours is something like O(n1*n2*k) or O(n1*n2) (assuming most comparisons abort early). I guess with much much longer strings you'd be slower, despite the huge overhead of the OP's solution. I guess, rolling hash could beat both.

Question 10

No. The worst-case complexity of the OP's algorithm is \$\mathcal{O}((n_1 - k)(n_2 - k)k)\$; so is the complexity of my algorithm.

Question 11

You're right, the lower complexity holds for the improved version using sets correctly.

Tunaki Tunaki 9,3011 gold badge31 silver badges46 bronze badges · Accepted Answer · 2016-06-03 10:12:23Z

Don't add unnecessary pre-checks

Your method getSubstringsSet starts with two early returns:

if(aString.length() < lengthOfSubstring){
 return setOfSubstrings;
}
if(aString.length() == lengthOfSubstring){
 setOfSubstrings.add(aString);
 return setOfSubstrings;
}

You actually don't need them. The rest of the code handles both cases just fine and it doesn't add to clarity or improve performance. Early-returns should be used when the rest of the code can then assume a condition that was checked, making its code simpler; but it is not the case here.

Use built-in methods

To add the substrings, you are currently doing:

StringBuilder sb = new StringBuilder();
//add each substring of length (lengthOfSubstring) to the setOfSubstrings. 
for(int j = 0; j < lengthOfSubstring; j++){
 sb.append(charArray[i + j]);
}
setOfSubstrings.add(sb.toString());

So you're creating a new StringBuilder and then appending each character. This is a complicated way of retrieving the substring between two indices. It also allocates a new StringBuilder each time which isn't really needed.

You can just have

setOfSubstrings.add(aString.substring(i, i + lengthOfSubstring));

In the same way, stringIntersect is a lot more complicated than it should be.

First of all, you are using a label outerForLoop and then using it to break out of the inner loop with break outerForLoop;. This is generally not a good practice. What this shows is a missing method: what you really want is to make a method isIntersection(set1, set2) that would determine if the two sets have elements in common.

However, you don't even need to write one, since it is already built-in: Collections#disjoint. This method returns true if the two given collections are disjoint, i.e. have no elements in common. Therefore, you can remove your double for loop and just have:

boolean sameSubstringInBothStrings = !Collections.disjoint(setOne, setTwo);

it would not be linear time anyway. Comparing two string is o(n). Comparing m strings (what is the main issue of this strategy) is o(nm).
@rdllopes Hmm are you sure? Building the two sets is linear and the disjoint call is also linear since contains is constant-time for sets. There could be an issue with substring though.
Yes, the problem is with substring. Substring operation is O(n) (there is no hope here... substring is an arrayCopy , what is the fastest O(n) copy operation but yet O(n)) So building a set with all substrings is O(mn).
But also, hashCode calculation of a String is O(n) and same for equals. So Collections.disjoint, retains, etc rely on the hashCode and equals methods. Executing those operation for all elements is O(n^2).
Yes. Len is arbitrary value. It could be considered constant value or part of complexity. I think the question is considering the first interpretation and you are right.

Stack Exchange Network

Determining if two Strings have common subtrings of a given length in Java

2 Answers 2

Don't add unnecessary pre-checks

Use built-in methods

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Determining if two Strings have common subtrings of a given length in Java

2 Answers 2

Don't add unnecessary pre-checks

Use built-in methods

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions