Substring search in Java

Question 1

I am trying to make an algorithm that will find the index of a given substring (which I have labeled pattern) in a given String (which I have labeled text). This method can be compared to String.indexOf(String str). In addition to general feedback, I'm curious what the time complexity of my method is, and would appreciate if someone can help me figure it out.

import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.NoSuchElementException;
import java.util.Set;
public class SubstringSearch {
 public static void main(String[] args) {
 String text = "This is a test String";
 String pattern = "is a";
 int index = search(pattern, text);
 System.out.println(index);
 }
 private static int search(String pattern, String text) {
 int patternLength = pattern.length();
 int textLength = text.length();
 Set<Integer> set = new HashSet<>(textLength);
 List<Integer> addList = new ArrayList<>(textLength);
 for (int i = 0; i < textLength; i++) {
 set.add(0);
 for (Integer index : set) {
 if (pattern.charAt(index) == text.charAt(i)) {
 addList.add(index + 1);
 }
 }
 set.clear();
 set.addAll(addList);
 if (set.contains(patternLength)) {
 return i - patternLength + 1;
 }
 }
 throw new NoSuchElementException();
 }
}

Question 2

Hint 1: You don't need neither addList nor set. Hint 2: the complexity is \$ \mathcal{O}(n^2) \$

Question 3

@MiguelAvila I’ve done a bit of research and found the Knuth–Morris–Pratt algorithm. Is that what you were referring to, or is there a more trivial way to do it that still avoids using the Set and the List?

Question 4

You can do it with an statistical approach. Suposse comparing a string in average time \$O(1)\$ and worst case \$O(n)\,ドル how is it possible? you will compare pattern[i] == text[i] && pattern[i + pattern.length - 1] == text[i + text.length - 1] using a for condition i < pattern.length / 2. It is less probable for human words to coincide in such a way more than 3 end-to-end comparisons with equal lenght. This is for single mathch not substrings, still you may extend the reasoning. Respect to the complexity of this new algorithm it could be in average \$O(n)\$ and in worst case \$O(n^2)\$

Question 5

The Knuth-Morris-Pratt algorithm stems from a time where a character code on a computer had 256 chars. Back then, it was an interesting approach. But today, as we are working with unicode (about 143000 characters), KMP is outdated. Don't use it.

Question 6

Nice implementation, my suggestions:

Bug: text="aa aaa" and pattern="aaa" returns 1 instead of 3. Whether to handle this edge case depends on your use case, but if text can be any string then it should be considered.
Input validation: if pattern is empty or null, a null pointer exception is thrown. You can fix it by adding a condition at the beginning of the method.
Naming: the method name search is too general. Consider a more specific one, like firstIndexOf or similar. The variable names set and addList also can be improved.
Arguments order: this is a personal opinion, but I would put the string being searched (text) before the string to search for (pattern).
Exception if not found: for such a function is common to return -1 instead of an exception when the pattern is not found. In doubt, document the method.
Testing: this method deserves more than one test, possibly using JUnit.
Complexity: the complexity seems to be \$O(textLenght*patternLength)\$. The outer for-loop iterates on text, while the inner for-loop won't do more than patternLength iterations otherwise pattern.charAt(index) will throw an exception. The time complexity of String#indexOf is \$O(n*m)\$ so the main difference to your solution regarding complexity (worst case) is probably the space complexity. Anyway, it's better to focus on the correct functionality first and performance later.

As a reference this is the native implementation:

static int indexOf(char[] source, int sourceOffset, int sourceCount,
 char[] target, int targetOffset, int targetCount,
 int fromIndex) {
 if (fromIndex >= sourceCount) {
 return (targetCount == 0 ? sourceCount : -1);
 }
 if (fromIndex < 0) {
 fromIndex = 0;
 }
 if (targetCount == 0) {
 return fromIndex;
 }
 char first = target[targetOffset];
 int max = sourceOffset + (sourceCount - targetCount);
 for (int i = sourceOffset + fromIndex; i <= max; i++) {
 /* Look for first character. */
 if (source[i] != first) {
 while (++i <= max && source[i] != first);
 }
 /* Found first character, now look at the rest of v2 */
 if (i <= max) {
 int j = i + 1;
 int end = j + targetCount - 1;
 for (int k = targetOffset + 1; j < end && source[j]
 == target[k]; j++, k++);
 if (j == end) {
 /* Found whole string. */
 return i - sourceOffset;
 }
 }
 }
 return -1;
}

Marc Marc 5,7342 gold badges15 silver badges35 bronze badges · Accepted Answer · 2021-01-25 07:45:36Z

Nice implementation, my suggestions:

Bug: text="aa aaa" and pattern="aaa" returns 1 instead of 3. Whether to handle this edge case depends on your use case, but if text can be any string then it should be considered.
Input validation: if pattern is empty or null, a null pointer exception is thrown. You can fix it by adding a condition at the beginning of the method.
Naming: the method name search is too general. Consider a more specific one, like firstIndexOf or similar. The variable names set and addList also can be improved.
Arguments order: this is a personal opinion, but I would put the string being searched (text) before the string to search for (pattern).
Exception if not found: for such a function is common to return -1 instead of an exception when the pattern is not found. In doubt, document the method.
Testing: this method deserves more than one test, possibly using JUnit.
Complexity: the complexity seems to be \$O(textLenght*patternLength)\$. The outer for-loop iterates on text, while the inner for-loop won't do more than patternLength iterations otherwise pattern.charAt(index) will throw an exception. The time complexity of String#indexOf is \$O(n*m)\$ so the main difference to your solution regarding complexity (worst case) is probably the space complexity. Anyway, it's better to focus on the correct functionality first and performance later.

As a reference this is the native implementation:

static int indexOf(char[] source, int sourceOffset, int sourceCount,
 char[] target, int targetOffset, int targetCount,
 int fromIndex) {
 if (fromIndex >= sourceCount) {
 return (targetCount == 0 ? sourceCount : -1);
 }
 if (fromIndex < 0) {
 fromIndex = 0;
 }
 if (targetCount == 0) {
 return fromIndex;
 }
 char first = target[targetOffset];
 int max = sourceOffset + (sourceCount - targetCount);
 for (int i = sourceOffset + fromIndex; i <= max; i++) {
 /* Look for first character. */
 if (source[i] != first) {
 while (++i <= max && source[i] != first);
 }
 /* Found first character, now look at the rest of v2 */
 if (i <= max) {
 int j = i + 1;
 int end = j + targetCount - 1;
 for (int k = targetOffset + 1; j < end && source[j]
 == target[k]; j++, k++);
 if (j == end) {
 /* Found whole string. */
 return i - sourceOffset;
 }
 }
 }
 return -1;
}

Stack Exchange Network

Substring search in Java

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Substring search in Java

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions