A Sub-string Extractor Implementation with Given Vocabulary Words in Java

Question 1

This is a follow-up question for A Sub-string Extractor Implementation with Specific Keywords in Java. As AJNeufeld's answer mentioned, the test case getTargetString("Where the start is before here.", "start", "here"); will return "". I am going to fix this with the vocabulary word comparison technique in this version.

The basic rules are the same:

The leading/trailing spaces in output sub-string is removed.
If the given start keyword is an empty string, it means that the anchor is at the start of the input string. Otherwise, the first occurrence of the given start keyword is an start anchor. If there is no any occurrence of the given start keyword, the output is an empty string.
If the given end keyword is an empty string, it means that the anchor is at the end of the input string. Otherwise, the first occurrence of the given end keyword is an end anchor. If there is no any occurrence of the given end keyword, the output is an empty string.
If the location of start anchor is after than the location of end anchor, or a part of the first occurrence of the given start keyword and a part of the first occurrence of the given end keyword are overlapped, the output is an empty string.

The example input and output is listed as below.

Input String	Start Keyword	End Keyword	Output
"Where the start is before here."	"start"	"here"	"is before"
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible"	""(empty string)	""(empty string)	"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible"
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible"	""(empty string)	"dependencies"	"Java is a class-based, object-oriented programming language that is designed to have as few implementation"
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible"	"Java"	""(empty string)	"is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible"
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible"	"Java"	"dependencies"	"is a class-based, object-oriented programming language that is designed to have as few implementation"
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible"	"dependencies"	""(empty string)	"as possible"
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible"	""(empty string)	"Java"	""(empty string)
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible"	"dependencies"	"Java"	""(empty string)
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible"	"ABC"	""	""(empty string)
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible"	""	"XYZ"	""(empty string)
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible"	"ABC"	"XYZ"	""(empty string)
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible"	"Java"	"Java"	""(empty string)

The experimental implementation

The experimental implementation is as below.

private static String getTargetString(String input, String startKeyword, String endKeyword)
{
 ArrayList<String> parts = new ArrayList<>(
 Arrays.asList(input.split(" ")));
 
 if (!startKeyword.isEmpty() && parts
 .stream()
 .map(n -> n.replaceAll("[^a-zA-Z]+",""))
 .collect(Collectors.toList())
 .contains(startKeyword) == false)
 return "";
 
 if (!endKeyword.isEmpty() && parts
 .stream()
 .map(n -> n.replaceAll("[^a-zA-Z]+",""))
 .collect(Collectors.toList())
 .contains(endKeyword) == false)
 return "";
 
 int startIndex = -1;
 int endIndex = parts.size();
 
 // searching startIndex
 int index = 0;
 for(String element : parts) {
 if (element.replaceAll("[^a-zA-Z]+","").equals(startKeyword))
 {
 startIndex = index;
 }
 index++;
 }
 
 // searching endIndex
 index = 0;
 for(String element : parts) {
 if (element.replaceAll("[^a-zA-Z]+","").equals(endKeyword))
 {
 endIndex = index;
 }
 index++;
 }
 
 // generating output string
 String output = "";
 index = 0;
 for(String element : parts) {
 if (index > startIndex && index < endIndex)
 {
 output += (element + ' ');
 }
 index++;
 }
 return output.trim();
}

Test cases

String testString1 = "Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible";
System.out.println(getTargetString(testString1, "", ""));
System.out.println(getTargetString(testString1, "", "dependencies"));
System.out.println(getTargetString(testString1, "Java", ""));
System.out.println(getTargetString(testString1, "Java", "dependencies"));
System.out.println(getTargetString(testString1, "dependencies", ""));
System.out.println(getTargetString(testString1, "", "Java"));
System.out.println(getTargetString(testString1, "dependencies", "Java"));
System.out.println(getTargetString(testString1, "ABC", ""));
System.out.println(getTargetString(testString1, "", "XYZ"));
System.out.println(getTargetString(testString1, "ABC", "XYZ"));
System.out.println(getTargetString(testString1, "Java", "Java"));
System.out.println(
 getTargetString("Where the start is before here.", "start", "here")
 );

The output of the above test cases.

Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible
Java is a class-based, object-oriented programming language that is designed to have as few implementation
is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible
is a class-based, object-oriented programming language that is designed to have as few implementation
as possible
is before

Coding Ground Link

All suggestions are welcome.

The summary information:

Which question it is a follow-up to?

A Sub-string Extractor Implementation with Specific Keywords in Java
What changes has been made in the code since last question?

Considering the mentioned holes, especially the case getTargetString("Where the start is before here.", "start", "here");, a possible fix is implemented here.
Why a new review is being asked for?

If there is any possible improvement or any other potential defect, please let me know.

Question 2

@TorbenPutkonen For some reason, I want to make the method getTargetString be able to return the origin input string by a kind of start / end keyword pattern. A way comes up in my mind is to set ""(empty string) as the symbol that representing the anchor is at the start / end of the input string.

Question 3

Edge case 1

String result = getTargetString("Java is class-based", "Java", "class-based");

I would expect the result to be "is", but it is empty. This is because replaceAll("[^a-zA-Z]+","") removes the dash in class-based so the keyword doesn't match.

Same for numbers:

String result = getTargetString("Java15 is out", "Java15", "out");
// result is ""

Edge Case 2

String result = getTargetString("hey - I am here", "hey", "here");

Here result is "- I am" instead of "I am". Is it expected? The reason is that replaceAll is not used in the last part of the method.

Splitting a string into words

In general, splitting a string into words is not a simple task because there are many edge cases. To reduce the complexity of your solution is important to define in your context what a "word" is and make some assumptions.

Programming to an interface

Try to use the most generic type to don't depend on subtypes. This concept is called Programming to an interface.

ArrayList<String> parts = new ArrayList<>(Arrays.asList(input.split(" ")));

Using List it's enough:

List<String> parts = Arrays.asList(input.split(" "));

Simplify

Instead of calling replaceAll four times, consider to parse the input at the beginning and reuse it across the method. It improves performance and readability.

List<String> parsedInput = parts.stream()
 .map(n -> n.replaceAll("[^a-zA-Z]+", ""))
 .collect(Collectors.toList());
if (!startKeyword.isEmpty() && !parsedInput.contains(startKeyword))
 return "";
//...

Simplify 2

If the input is parsed at the beginning, then this part:

int startIndex = -1;
int endIndex = parts.size();
// searching startIndex
int index = 0;
for(String element : parts) {
 if (element.replaceAll("[^a-zA-Z]+","").equals(startKeyword))
 {
 startIndex = index;
 }
 index++;
}
// searching endIndex
index = 0;
for(String element : parts) {
 if (element.replaceAll("[^a-zA-Z]+","").equals(endKeyword))
 {
 endIndex = index;
 }
 index++;
}

can become:

int startIndex = parsedInput.lastIndexOf(startKeyword);
int endIndex = endKeyword.isEmpty() ? parts.size() : parsedInput.lastIndexOf(endKeyword);

Generating the output

String output = "";
index = 0;
for(String element : parts) {
 if (index > startIndex && index < endIndex)
 {
 output += (element + ' ');
 }
 index++;
}
return output.trim();

Two suggestions about this part:

Use StringBuilder to generate output more efficiently
A regular for-loop seems to be a better fit since the start and end indices are known.

StringBuilder output = new StringBuilder();
for (int index = startIndex + 1; index < endIndex; index++) {
 output.append(parts.get(index)).append(' ');
}
return output.toString().trim();

Input validation

If input is null, the method throws NullPointerException. Would be better to throw an IllegalArgumentException with a friendlier message:

if (input == null) {
 throw new IllegalArgumentException("input cannot be null.");
}

startKeyword and endKeyword also need to be validated when null.

Testing

Testing with System.out.println should be very limited if not avoided completely. The reason is that testing using the console is slow and error-prone. The code is tested visually by us every time, and with many tests it becomes unfeasible.

A good practice is to create unit tests. In Java, you can use JUnit. For example:

public class StringUtilTest {
 @Test
 public void baseTest() {
 String actual = getTargetString("JUnit for testing", "JUnit", "testing");
 assertEquals("for", actual);
 }
 @Test
 public void ifStartKeywordNotContainedThenEmptyString(){
 String actual = getTargetString("One example", "X", "Y");
 assertEquals("", actual);
 }
 //... More tests
}

Alternative

Regex is very powerful, the whole method could be:

public static String getTargetString(String input, String startKeyword, String endKeyword){
 Pattern pattern = Pattern.compile("(?<="+startKeyword+")(.*)(?="+endKeyword+")");
 Matcher matcher = pattern.matcher(input);
 return matcher.find() ? matcher.group(1).trim() : "";
}

Question 4

You really should Pattern.quote the startKeyword and endKeyword to avoid pattern syntax errors. Also, you'd want to add \\b's to the regex to avoid "here" matching "Where".

Marc MarcMarc 5,7242 gold badges15 silver badges35 bronze badges · Accepted Answer · 2021-01-17 06:14:11Z

Edge case 1

String result = getTargetString("Java is class-based", "Java", "class-based");

I would expect the result to be "is", but it is empty. This is because replaceAll("[^a-zA-Z]+","") removes the dash in class-based so the keyword doesn't match.

Same for numbers:

String result = getTargetString("Java15 is out", "Java15", "out");
// result is ""

Edge Case 2

String result = getTargetString("hey - I am here", "hey", "here");

Here result is "- I am" instead of "I am". Is it expected? The reason is that replaceAll is not used in the last part of the method.

Splitting a string into words

In general, splitting a string into words is not a simple task because there are many edge cases. To reduce the complexity of your solution is important to define in your context what a "word" is and make some assumptions.

Programming to an interface

Try to use the most generic type to don't depend on subtypes. This concept is called Programming to an interface.

ArrayList<String> parts = new ArrayList<>(Arrays.asList(input.split(" ")));

Using List it's enough:

List<String> parts = Arrays.asList(input.split(" "));

Simplify

Instead of calling replaceAll four times, consider to parse the input at the beginning and reuse it across the method. It improves performance and readability.

List<String> parsedInput = parts.stream()
 .map(n -> n.replaceAll("[^a-zA-Z]+", ""))
 .collect(Collectors.toList());
if (!startKeyword.isEmpty() && !parsedInput.contains(startKeyword))
 return "";
//...

Simplify 2

If the input is parsed at the beginning, then this part:

int startIndex = -1;
int endIndex = parts.size();
// searching startIndex
int index = 0;
for(String element : parts) {
 if (element.replaceAll("[^a-zA-Z]+","").equals(startKeyword))
 {
 startIndex = index;
 }
 index++;
}
// searching endIndex
index = 0;
for(String element : parts) {
 if (element.replaceAll("[^a-zA-Z]+","").equals(endKeyword))
 {
 endIndex = index;
 }
 index++;
}

can become:

int startIndex = parsedInput.lastIndexOf(startKeyword);
int endIndex = endKeyword.isEmpty() ? parts.size() : parsedInput.lastIndexOf(endKeyword);

Generating the output

String output = "";
index = 0;
for(String element : parts) {
 if (index > startIndex && index < endIndex)
 {
 output += (element + ' ');
 }
 index++;
}
return output.trim();

Two suggestions about this part:

Use StringBuilder to generate output more efficiently
A regular for-loop seems to be a better fit since the start and end indices are known.

StringBuilder output = new StringBuilder();
for (int index = startIndex + 1; index < endIndex; index++) {
 output.append(parts.get(index)).append(' ');
}
return output.toString().trim();

Input validation

If input is null, the method throws NullPointerException. Would be better to throw an IllegalArgumentException with a friendlier message:

if (input == null) {
 throw new IllegalArgumentException("input cannot be null.");
}

startKeyword and endKeyword also need to be validated when null.

Testing

Testing with System.out.println should be very limited if not avoided completely. The reason is that testing using the console is slow and error-prone. The code is tested visually by us every time, and with many tests it becomes unfeasible.

A good practice is to create unit tests. In Java, you can use JUnit. For example:

public class StringUtilTest {
 @Test
 public void baseTest() {
 String actual = getTargetString("JUnit for testing", "JUnit", "testing");
 assertEquals("for", actual);
 }
 @Test
 public void ifStartKeywordNotContainedThenEmptyString(){
 String actual = getTargetString("One example", "X", "Y");
 assertEquals("", actual);
 }
 //... More tests
}

Alternative

Regex is very powerful, the whole method could be:

public static String getTargetString(String input, String startKeyword, String endKeyword){
 Pattern pattern = Pattern.compile("(?<="+startKeyword+")(.*)(?="+endKeyword+")");
 Matcher matcher = pattern.matcher(input);
 return matcher.find() ? matcher.group(1).trim() : "";
}

You really should Pattern.quote the startKeyword and endKeyword to avoid pattern syntax errors. Also, you'd want to add \\b's to the regex to avoid "here" matching "Where".

Stack Exchange Network

A Sub-string Extractor Implementation with Given Vocabulary Words in Java

1 Answer 1

Edge case 1

Edge Case 2

Splitting a string into words

Programming to an interface

Simplify

Simplify 2

Generating the output

Input validation

Testing

Alternative

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

A Sub-string Extractor Implementation with Given Vocabulary Words in Java

1 Answer 1

Edge case 1

Edge Case 2

Splitting a string into words

Programming to an interface

Simplify

Simplify 2

Generating the output

Input validation

Testing

Alternative

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions