Problem:
Given 2 strings, consider all the substrings within them of length len. Len will be 1 or more. Returns true if there are any such substrings which appear in both strings. Compute this in linear time using a HashSet.
I know as a matter of fact that there are many areas upon which my code could be improved. However, rather than asking other people to correct my code, I would much prefer to be given hints or general principles as to how my code could be improved. I think this way, me, and everyone that read this post could get more out of this exercise this way.
So could someone please provide me with some comments or suggestions? Critics of any level and kind are welcomed! Feel free to pick apart my code!
Class:
import java.util.HashSet;
import java.util.Set;
// CS108 HW1 -- String static methods
public class StringCode {
public static boolean stringIntersect(String firstString, String secondString, int lengthOfSubstring) {
boolean sameSubstringInBothStrings = false;
Set<String> setOne = StringCode.getSubstringsSet(firstString, lengthOfSubstring);
Set<String> setTwo = StringCode.getSubstringsSet(secondString, lengthOfSubstring);
//compare setOne and setTwo to find out if there is any matching elements
outerForLoop:
for(String aSubstringInSetOne : setOne){
for(String aSubstringInSetTwo : setTwo){
if(aSubstringInSetOne.equals(aSubstringInSetTwo)){
sameSubstringInBothStrings = true;
break outerForLoop;
}
}
}
return sameSubstringInBothStrings;
}
static Set<String> getSubstringsSet(String aString, int lengthOfSubstring){
Set<String> setOfSubstrings = new HashSet<>();
if(aString.length() < lengthOfSubstring){
return setOfSubstrings;
}
if(aString.length() == lengthOfSubstring){
setOfSubstrings.add(aString);
return setOfSubstrings;
}
char[] charArray = aString.toCharArray();
//starting from the first index, going through aString
for(int i = 0; i <= aString.length() - lengthOfSubstring; i++){
StringBuilder sb = new StringBuilder();
//add each substring of length (lengthOfSubstring) to the setOfSubstrings.
for(int j = 0; j < lengthOfSubstring; j++){
sb.append(charArray[i + j]);
}
setOfSubstrings.add(sb.toString());
}
return setOfSubstrings;
}
}
Test:
import assign1.StringCode;
import static org.junit.Assert.*;
import org.junit.Test;
public class StringCodeTest {
@Test
public void testStringIntersectNormalStrings(){
assertEquals(true, StringCode.stringIntersect("abcdickefg" , "zyxqdick", 4));
}
@Test
public void testStringIntersectNormalStrings2(){
assertEquals(false, StringCode.stringIntersect("abcdef", "zzzzzz", 1));
}
@Test
public void testEmptyString(){
assertEquals(false, StringCode.stringIntersect("abc", "", 2));
}
@Test
public void testLengthOfSubstringGreaterThanTheStringItself(){
assertEquals(false, StringCode.stringIntersect("abc", "xyz", 5));
}
@Test
public void testTwoEmptyStrings(){
assertEquals(false, StringCode.stringIntersect("", "", 2));
}
}
2 Answers 2
Don't add unnecessary pre-checks
Your method getSubstringsSet
starts with two early returns:
if(aString.length() < lengthOfSubstring){
return setOfSubstrings;
}
if(aString.length() == lengthOfSubstring){
setOfSubstrings.add(aString);
return setOfSubstrings;
}
You actually don't need them. The rest of the code handles both cases just fine and it doesn't add to clarity or improve performance. Early-returns should be used when the rest of the code can then assume a condition that was checked, making its code simpler; but it is not the case here.
Use built-in methods
To add the substrings, you are currently doing:
StringBuilder sb = new StringBuilder();
//add each substring of length (lengthOfSubstring) to the setOfSubstrings.
for(int j = 0; j < lengthOfSubstring; j++){
sb.append(charArray[i + j]);
}
setOfSubstrings.add(sb.toString());
So you're creating a new StringBuilder
and then appending each character. This is a complicated way of retrieving the substring
between two indices. It also allocates a new StringBuilder
each time which isn't really needed.
You can just have
setOfSubstrings.add(aString.substring(i, i + lengthOfSubstring));
In the same way, stringIntersect
is a lot more complicated than it should be.
First of all, you are using a label outerForLoop
and then using it to break out of the inner loop with break outerForLoop;
. This is generally not a good practice. What this shows is a missing method: what you really want is to make a method isIntersection(set1, set2)
that would determine if the two sets have elements in common.
However, you don't even need to write one, since it is already built-in: Collections#disjoint
. This method returns true
if the two given collections are disjoint, i.e. have no elements in common. Therefore, you can remove your double for loop and just have:
boolean sameSubstringInBothStrings = !Collections.disjoint(setOne, setTwo);
-
\$\begingroup\$ it would not be linear time anyway. Comparing two string is o(n). Comparing m strings (what is the main issue of this strategy) is o(nm). \$\endgroup\$rdllopes– rdllopes2016年06月03日 11:59:57 +00:00Commented Jun 3, 2016 at 11:59
-
\$\begingroup\$ @rdllopes Hmm are you sure? Building the two sets is linear and the disjoint call is also linear since
contains
is constant-time for sets. There could be an issue withsubstring
though. \$\endgroup\$Tunaki– Tunaki2016年06月03日 12:20:09 +00:00Commented Jun 3, 2016 at 12:20 -
\$\begingroup\$ Yes, the problem is with substring. Substring operation is O(n) (there is no hope here... substring is an arrayCopy , what is the fastest O(n) copy operation but yet O(n)) So building a set with all substrings is O(mn). \$\endgroup\$rdllopes– rdllopes2016年06月03日 13:22:44 +00:00Commented Jun 3, 2016 at 13:22
-
\$\begingroup\$ But also, hashCode calculation of a String is O(n) and same for equals. So Collections.disjoint, retains, etc rely on the hashCode and equals methods. Executing those operation for all elements is O(n^2). \$\endgroup\$rdllopes– rdllopes2016年06月03日 13:27:05 +00:00Commented Jun 3, 2016 at 13:27
-
1\$\begingroup\$ Yes. Len is arbitrary value. It could be considered constant value or part of complexity. I think the question is considering the first interpretation and you are right. \$\endgroup\$rdllopes– rdllopes2016年06月03日 14:18:07 +00:00Commented Jun 3, 2016 at 14:18
1
What comes to stringIntersect
, you don't need to use sameSubstringInBothStrings
. Instead, just return true
or false
as soon as you know the result of the method:
public static boolean stringIntersect(String firstString, String secondString, int lengthOfSubstring) {
boolean sameSubstringInBothStrings = false;
Set<String> setOne = StringCode.getSubstringsSet(firstString, lengthOfSubstring);
Set<String> setTwo = StringCode.getSubstringsSet(secondString, lengthOfSubstring);
//compare setOne and setTwo to find out if there is any matching elements
outerForLoop:
for(String aSubstringInSetOne : setOne){
for(String aSubstringInSetTwo : setTwo){
if(aSubstringInSetOne.equals(aSubstringInSetTwo)){
return true;
}
}
}
return false;
}
2 Profile when in doubt
I don't see why your Set
kicking preprocessing would be any better than a dumb brute force solution to the problem. I had this in mind...
import java.util.Random;
public class StringCodeV2 {
public static boolean stringIntersect(String firstString,
String secondString,
int k) {
checkSubstringSize(k);
final char[] firstStringChars = firstString.toCharArray();
final char[] secondStringChars = secondString.toCharArray();
for (int firstStringStartIndex = 0;
firstStringStartIndex < firstStringChars.length - k + 1;
firstStringStartIndex++) {
for (int secondStringStartIndex = 0;
secondStringStartIndex < secondStringChars.length - k + 1;
secondStringStartIndex++) {
int charsRead = 0;
while (firstStringStartIndex + charsRead <
firstStringChars.length
&&
secondStringStartIndex + charsRead <
secondStringChars.length) {
if (firstStringChars[firstStringStartIndex + charsRead] ==
secondStringChars[secondStringStartIndex + charsRead]) {
charsRead++;
if (charsRead == k) {
return true;
}
} else {
break;
}
}
}
}
return false;
}
private static void checkSubstringSize(int k) {
if (k < 1) {
throw new IllegalArgumentException(
"The length of the substring is less than 1.");
}
}
public static void main(String[] args) {
System.out.println("[STATUS] Warming up...");
warmup();
System.out.println("[STATUS] Warming up done!");
long seed = System.currentTimeMillis();
Random random = new Random(seed);
String a = getRandomString(5000, random);
String b = getRandomString(5000, random);
System.out.println("Seed = " + seed);
long startTime = System.nanoTime();
boolean intersect1 = StringCode.stringIntersect(a, b, 5);
long endTime = System.nanoTime();
System.out.println("StringCode.stringIntersect in " +
(endTime - startTime) / 1e6 + " milliseconds.");
startTime = System.nanoTime();
boolean intersect2 = StringCodeV2.stringIntersect(a, b, 5);
endTime = System.nanoTime();
System.out.println("StringCodeV2.stringIntersect in " +
(endTime - startTime) / 1e6 + " milliseconds.");
System.out.println("Algorithms agree: " + (intersect1 == intersect2));
}
private static void warmup() {
Random random = new Random();
for (int i = 0; i < 100; ++i) {
String a = getRandomString(1000, random);
String b = getRandomString(1000, random);
StringCode.stringIntersect(a, b, 300);
StringCodeV2.stringIntersect(a, b, 300);
}
}
private static String getRandomString(int size, Random random) {
StringBuilder sb = new StringBuilder(size);
for (int i = 0; i < size; ++i) {
sb.append((char)(random.nextInt(26) + 'A'));
}
return sb.toString();
}
}
... and it yields sometimes as optimistic peformance figures as the following:
[STATUS] Warming up... [STATUS] Warming up done! Seed = 1465063085270 StringCode.stringIntersect in 344.266806 milliseconds. StringCodeV2.stringIntersect in 17.754162 milliseconds. Algorithms agree: true
Hope that helps.
-
\$\begingroup\$ The complexity of the OP's algorithm is
O((n1+n2)*k)
wheren1
andn2
are the string lengths, the complexity of yours is something likeO(n1*n2*k)
orO(n1*n2)
(assuming most comparisons abort early). I guess with much much longer strings you'd be slower, despite the huge overhead of the OP's solution. I guess, rolling hash could beat both. \$\endgroup\$maaartinus– maaartinus2016年06月05日 04:07:47 +00:00Commented Jun 5, 2016 at 4:07 -
\$\begingroup\$ No. The worst-case complexity of the OP's algorithm is \$\mathcal{O}((n_1 - k)(n_2 - k)k)\$; so is the complexity of my algorithm. \$\endgroup\$coderodde– coderodde2016年06月05日 04:23:43 +00:00Commented Jun 5, 2016 at 4:23
-
\$\begingroup\$ You're right, the lower complexity holds for the improved version using sets correctly. \$\endgroup\$maaartinus– maaartinus2016年06月05日 13:15:16 +00:00Commented Jun 5, 2016 at 13:15