This is a follow-up question for A Sub-string Extractor Implementation with Specific Keywords in Java. As AJNeufeld's answer mentioned, the test case getTargetString("Where the start is before here.", "start", "here");
will return ""
. I am going to fix this with the vocabulary word comparison technique in this version.
The basic rules are the same:
- The leading/trailing spaces in output sub-string is removed.
- If the given start keyword is an empty string, it means that the anchor is at the start of the input string. Otherwise, the first occurrence of the given start keyword is an start anchor. If there is no any occurrence of the given start keyword, the output is an empty string.
- If the given end keyword is an empty string, it means that the anchor is at the end of the input string. Otherwise, the first occurrence of the given end keyword is an end anchor. If there is no any occurrence of the given end keyword, the output is an empty string.
- If the location of start anchor is after than the location of end anchor, or a part of the first occurrence of the given start keyword and a part of the first occurrence of the given end keyword are overlapped, the output is an empty string.
The example input and output is listed as below.
Input String | Start Keyword | End Keyword | Output |
---|---|---|---|
"Where the start is before here." | "start" | "here" | "is before" |
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible" | ""(empty string) | ""(empty string) | "Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible" |
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible" | ""(empty string) | "dependencies" | "Java is a class-based, object-oriented programming language that is designed to have as few implementation" |
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible" | "Java" | ""(empty string) | "is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible" |
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible" | "Java" | "dependencies" | "is a class-based, object-oriented programming language that is designed to have as few implementation" |
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible" | "dependencies" | ""(empty string) | "as possible" |
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible" | ""(empty string) | "Java" | ""(empty string) |
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible" | "dependencies" | "Java" | ""(empty string) |
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible" | "ABC" | "" | ""(empty string) |
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible" | "" | "XYZ" | ""(empty string) |
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible" | "ABC" | "XYZ" | ""(empty string) |
"Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible" | "Java" | "Java" | ""(empty string) |
The experimental implementation
The experimental implementation is as below.
private static String getTargetString(String input, String startKeyword, String endKeyword)
{
ArrayList<String> parts = new ArrayList<>(
Arrays.asList(input.split(" ")));
if (!startKeyword.isEmpty() && parts
.stream()
.map(n -> n.replaceAll("[^a-zA-Z]+",""))
.collect(Collectors.toList())
.contains(startKeyword) == false)
return "";
if (!endKeyword.isEmpty() && parts
.stream()
.map(n -> n.replaceAll("[^a-zA-Z]+",""))
.collect(Collectors.toList())
.contains(endKeyword) == false)
return "";
int startIndex = -1;
int endIndex = parts.size();
// searching startIndex
int index = 0;
for(String element : parts) {
if (element.replaceAll("[^a-zA-Z]+","").equals(startKeyword))
{
startIndex = index;
}
index++;
}
// searching endIndex
index = 0;
for(String element : parts) {
if (element.replaceAll("[^a-zA-Z]+","").equals(endKeyword))
{
endIndex = index;
}
index++;
}
// generating output string
String output = "";
index = 0;
for(String element : parts) {
if (index > startIndex && index < endIndex)
{
output += (element + ' ');
}
index++;
}
return output.trim();
}
Test cases
String testString1 = "Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible";
System.out.println(getTargetString(testString1, "", ""));
System.out.println(getTargetString(testString1, "", "dependencies"));
System.out.println(getTargetString(testString1, "Java", ""));
System.out.println(getTargetString(testString1, "Java", "dependencies"));
System.out.println(getTargetString(testString1, "dependencies", ""));
System.out.println(getTargetString(testString1, "", "Java"));
System.out.println(getTargetString(testString1, "dependencies", "Java"));
System.out.println(getTargetString(testString1, "ABC", ""));
System.out.println(getTargetString(testString1, "", "XYZ"));
System.out.println(getTargetString(testString1, "ABC", "XYZ"));
System.out.println(getTargetString(testString1, "Java", "Java"));
System.out.println(
getTargetString("Where the start is before here.", "start", "here")
);
The output of the above test cases.
Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible
Java is a class-based, object-oriented programming language that is designed to have as few implementation
is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible
is a class-based, object-oriented programming language that is designed to have as few implementation
as possible
is before
All suggestions are welcome.
The summary information:
Which question it is a follow-up to?
A Sub-string Extractor Implementation with Specific Keywords in Java
What changes has been made in the code since last question?
Considering the mentioned holes, especially the case
getTargetString("Where the start is before here.", "start", "here");
, a possible fix is implemented here.Why a new review is being asked for?
If there is any possible improvement or any other potential defect, please let me know.
1 Answer 1
Edge case 1
String result = getTargetString("Java is class-based", "Java", "class-based");
I would expect the result to be "is", but it is empty. This is because replaceAll("[^a-zA-Z]+","")
removes the dash in class-based
so the keyword doesn't match.
Same for numbers:
String result = getTargetString("Java15 is out", "Java15", "out");
// result is ""
Edge Case 2
String result = getTargetString("hey - I am here", "hey", "here");
Here result
is "- I am" instead of "I am". Is it expected? The reason is that replaceAll
is not used in the last part of the method.
Splitting a string into words
In general, splitting a string into words is not a simple task because there are many edge cases. To reduce the complexity of your solution is important to define in your context what a "word" is and make some assumptions.
Programming to an interface
Try to use the most generic type to don't depend on subtypes. This concept is called Programming to an interface.
ArrayList<String> parts = new ArrayList<>(Arrays.asList(input.split(" ")));
Using List
it's enough:
List<String> parts = Arrays.asList(input.split(" "));
Simplify
Instead of calling replaceAll
four times, consider to parse the input at the beginning and reuse it across the method. It improves performance and readability.
List<String> parsedInput = parts.stream()
.map(n -> n.replaceAll("[^a-zA-Z]+", ""))
.collect(Collectors.toList());
if (!startKeyword.isEmpty() && !parsedInput.contains(startKeyword))
return "";
//...
Simplify 2
If the input is parsed at the beginning, then this part:
int startIndex = -1;
int endIndex = parts.size();
// searching startIndex
int index = 0;
for(String element : parts) {
if (element.replaceAll("[^a-zA-Z]+","").equals(startKeyword))
{
startIndex = index;
}
index++;
}
// searching endIndex
index = 0;
for(String element : parts) {
if (element.replaceAll("[^a-zA-Z]+","").equals(endKeyword))
{
endIndex = index;
}
index++;
}
can become:
int startIndex = parsedInput.lastIndexOf(startKeyword);
int endIndex = endKeyword.isEmpty() ? parts.size() : parsedInput.lastIndexOf(endKeyword);
Generating the output
String output = "";
index = 0;
for(String element : parts) {
if (index > startIndex && index < endIndex)
{
output += (element + ' ');
}
index++;
}
return output.trim();
Two suggestions about this part:
- Use
StringBuilder
to generateoutput
more efficiently - A regular for-loop seems to be a better fit since the start and end indices are known.
StringBuilder output = new StringBuilder();
for (int index = startIndex + 1; index < endIndex; index++) {
output.append(parts.get(index)).append(' ');
}
return output.toString().trim();
Input validation
If input
is null, the method throws NullPointerException
. Would be better to throw an IllegalArgumentException
with a friendlier message:
if (input == null) {
throw new IllegalArgumentException("input cannot be null.");
}
startKeyword
and endKeyword
also need to be validated when null
.
Testing
Testing with System.out.println
should be very limited if not avoided completely. The reason is that testing using the console is slow and error-prone. The code is tested visually by us every time, and with many tests it becomes unfeasible.
A good practice is to create unit tests. In Java, you can use JUnit. For example:
public class StringUtilTest {
@Test
public void baseTest() {
String actual = getTargetString("JUnit for testing", "JUnit", "testing");
assertEquals("for", actual);
}
@Test
public void ifStartKeywordNotContainedThenEmptyString(){
String actual = getTargetString("One example", "X", "Y");
assertEquals("", actual);
}
//... More tests
}
Alternative
Regex is very powerful, the whole method could be:
public static String getTargetString(String input, String startKeyword, String endKeyword){
Pattern pattern = Pattern.compile("(?<="+startKeyword+")(.*)(?="+endKeyword+")");
Matcher matcher = pattern.matcher(input);
return matcher.find() ? matcher.group(1).trim() : "";
}
-
1\$\begingroup\$ You really should
Pattern.quote
thestartKeyword
andendKeyword
to avoid pattern syntax errors. Also, you'd want to add\\b
's to the regex to avoid"here"
matching"Where"
. \$\endgroup\$AJNeufeld– AJNeufeld2021年01月17日 23:03:08 +00:00Commented Jan 17, 2021 at 23:03
getTargetString
be able to return the origin input string by a kind of start / end keyword pattern. A way comes up in my mind is to set ""(empty string) as the symbol that representing the anchor is at the start / end of the input string. \$\endgroup\$