I am trying to make an algorithm that will find the index of a given substring (which I have labeled pattern
) in a given String
(which I have labeled text
). This method can be compared to String.indexOf(String str)
. In addition to general feedback, I'm curious what the time complexity of my method is, and would appreciate if someone can help me figure it out.
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.NoSuchElementException;
import java.util.Set;
public class SubstringSearch {
public static void main(String[] args) {
String text = "This is a test String";
String pattern = "is a";
int index = search(pattern, text);
System.out.println(index);
}
private static int search(String pattern, String text) {
int patternLength = pattern.length();
int textLength = text.length();
Set<Integer> set = new HashSet<>(textLength);
List<Integer> addList = new ArrayList<>(textLength);
for (int i = 0; i < textLength; i++) {
set.add(0);
for (Integer index : set) {
if (pattern.charAt(index) == text.charAt(i)) {
addList.add(index + 1);
}
}
set.clear();
set.addAll(addList);
if (set.contains(patternLength)) {
return i - patternLength + 1;
}
}
throw new NoSuchElementException();
}
}
1 Answer 1
Nice implementation, my suggestions:
- Bug:
text="aa aaa"
andpattern="aaa"
returns 1 instead of 3. Whether to handle this edge case depends on your use case, but iftext
can be any string then it should be considered. - Input validation: if
pattern
is empty or null, a null pointer exception is thrown. You can fix it by adding a condition at the beginning of the method. - Naming: the method name
search
is too general. Consider a more specific one, likefirstIndexOf
or similar. The variable namesset
andaddList
also can be improved. - Arguments order: this is a personal opinion, but I would put the string being searched (
text
) before the string to search for (pattern
). - Exception if not found: for such a function is common to return
-1
instead of an exception when thepattern
is not found. In doubt, document the method. - Testing: this method deserves more than one test, possibly using JUnit.
- Complexity: the complexity seems to be \$O(textLenght*patternLength)\$. The outer for-loop iterates on
text
, while the inner for-loop won't do more thanpatternLength
iterations otherwisepattern.charAt(index)
will throw an exception. The time complexity ofString#indexOf
is \$O(n*m)\$ so the main difference to your solution regarding complexity (worst case) is probably the space complexity. Anyway, it's better to focus on the correct functionality first and performance later.
As a reference this is the native implementation:
static int indexOf(char[] source, int sourceOffset, int sourceCount,
char[] target, int targetOffset, int targetCount,
int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
char first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first);
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j]
== target[k]; j++, k++);
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}
addList
norset
. Hint 2: the complexity is \$ \mathcal{O}(n^2) \$ \$\endgroup\$Set
and theList
? \$\endgroup\$pattern[i] == text[i] && pattern[i + pattern.length - 1] == text[i + text.length - 1]
using a for conditioni < pattern.length / 2
. It is less probable for human words to coincide in such a way more than 3 end-to-end comparisons with equal lenght. This is for single mathch not substrings, still you may extend the reasoning. Respect to the complexity of this new algorithm it could be in average \$O(n)\$ and in worst case \$O(n^2)\$ \$\endgroup\$