Redacting everything in a string except a particular word

Question 1

I am working through the Coding Bat exercises for Java. Here is one I have just completed:

Given a string and a non-empty word string, return a version of the original String where all chars have been replaced by pluses ("+"), except for appearances of the word string which are preserved unchanged.

And here is my solution:

public String plusOut(String str, String word){
 String s = "";
 String n = str + " ";
 int i = 0;
 while(i < str.length()) {
 if (!(n.substring(i, i + word.length()).equals(word))) {
 s += "+";
 i++;
 } else {
 s += word;
 i += word.length();
 }
 }
 return s;
}

As you can see, I concatenated two spaces on the original string and used that to test the substrings. This was to avoid out of bounds exceptions.

What I want to do is optimise my solution, because I don't like how I added two spaces just to avoid errors. I want to change the if statement so that it tests the substrings of str rather than n, and get rid of n entirely. I have tried but I can't seem to find a way of doing it. How can I do this? What else needs optimising?

Question 2

Not a solution, but a hint is to check the word length before doing the substring. Your issue seems to be when i approaches the end of str, then you need to ensure that you don't attempt to substring past the end. Post back any attempts or success, and we can help push in the right direction.

Question 3

Manipulating Strings directly is a bad idea, especially when you are doing it repeatedly. Java strings are immutable, so every time you do s += ..., you are making a copy of the entire provisional result. Anytime you need to do a substantial amount of editing on a string, use a StringBuilder.

To look for the next occurrence of word, use String.indexOf() or StringBuilder.indexOf(). It's a lot easier than extracting substrings and checking for equality. When extracting the substring, your code crashes if word is longer than three characters. Using .indexOf() is one way to eliminate that problem.

Here is an example of a more efficient solution:

public String plusOut(String str, String word) {
 StringBuilder s = new StringBuilder(str);
 // Find substrings to redact, from index i (inclusive)
 // to index j (exclusive)
 for (int i = 0; i < s.length(); i += word.length()) {
 int j = s.indexOf(word, i);
 if (j < 0) {
 // word not found; redact the rest of the string
 j = s.length();
 }
 while (i < j) {
 s.setCharAt(i++, '+');
 }
 }
 return s.toString();
}

Question 4

Duly noted. Need to spend a lot of time with StringBuilder. Thanks for pointing me in the right direction.

Question 5

As @mklinker proposed, you could try to use the word lengths. I tried to come up with an example:

public String plusOut(String str, String word){
 String result = "";
 int s = str.length();
 int w = word.length();
 for (int i = 0; i < s; i++) {
 if (i <= s - w) {
 if (str.substring(i,i+w).equals(word)) {
 result += word;
 i += w-1;
 }
 else {
 result += "+";
 }
 }
 else {
 result += "+";
 }
 } 
 return result;
}

Question 6

I suggest rewriting the loop as

for (int i = 0; i < s; ) { if (i <= s - w && str.substring(i, i+w).equals(word)) { result += word; i += w; } else { result += "+"; i++; } }

. It's like the original code, except that the i <= s - w prerequisite acts to prevent .substring() from crashing. Moving i++ out of the loop header makes it clearer that i increases by the length of the newly appended suffix.

Question 7

Instead of incrementing i yourself, you could write the loop as while ((i = result.length()) < s) { ... }, which makes it clear that i is always the length of the result so far.

Question 8

At its core, this is a conditional string replacement problem, so it might make sense to use string replacement techniques.

public String plusOut(String str, String word) {
 StringBuffer result = new StringBuffer(str.length());
 Matcher m = Pattern.compile(Pattern.quote(word) + "|.").matcher(str);
 while (m.find()) {
 // Replace word with itself, and any other character with "+"
 m.appendReplacement(result, word.equals(m.group()) ? "0ドル" : "+");
 }
 return m.appendTail(result).toString();
}

At every point in str, it looks for either the word (which, if found, is replaced by itself) or a single character (which is replaced by a "+").

Pros

This solution is based on a standard recipe.
The thinking is at a higher level of abstraction, without headache-inducing indexes like i or the possibility of StringIndexOutOfBoundsExceptions. All of those details are taken care of by code within Matcher.

Cons

This solution is appealing to those who know regular expressions (which is really a separate skill from knowing Java), and probably befuddling to those who don't.
There are subtle pitfalls that could cause one to write this solution wrong:
- The alternation in the pattern has to be (?>word)|., not .|(?>word), due to the leftmost-longest rule of pattern matching.
- The word in the pattern has to be quoted in case word contains characters with special significance in regexes.
- The replacement text has to be "0ドル", not word, in case word contains a $ character.
It is less efficient than a well-implemented low-level StringBuilder-based solution. Still, it scales better than your original code (which is O(n²) due to inefficient string concatenation).

Question 9

200_success and DJanssens both make good points.

I want to add that, again, Regular Expressions are often useful for text manipulation. This work requires a complicated pattern if you want to keep the groups, but if you can ignore the groups, it is easier.

Note also, that StringBuilders are also more efficient than String concatenation.

Two traditional Java versions. One using a simple split, the other using a more complex one:

Simple split

public static String plusOut(String str, String word) {
 String[] parts = str.split(Pattern.quote(word));
 StringBuilder sb = new StringBuilder(str.length());
 for (int i = 0; i < parts.length; i++) {
 if (i > 0) {
 sb.append(word);
 }
 sb.append(parts[i].replaceAll(".", "+"));
 }
 return sb.toString();
}

Note the use of Pattern.quote(...) in case the input word is something like Boo(hoo)? ... we don't want the word to be interpreted as a regex pattern itself.

Complex split:

public static String plusOut(String str, String word) {
 String[] parts = str.split("(?=\\Q" + word + "\\E)|(?<=\\Q" + word + "\\E)");
 StringBuilder sb = new StringBuilder(str.length());
 for (int i = 0; i < parts.length; i++) {
 sb.append(i % 2 == 0 ? parts[i].replaceAll(".", "+") : parts[i]);
 }
 return sb.toString();
}

Java 8

In Java 8 I would consider the following:

public static String plusOut8(String str, String word) {
 return Stream.of(str.split("\\Q" + word + "\\E"))
 .map(fill -> fill.replaceAll(".", "+"))
 .collect(Collectors.joining(word));
}

See these all working in Ideone

Question 10

Solutions based on split() are fraught with peril. In fact, none of these three implementations is correct for plusOut("monkey-monkeymonkey", "monkey"), and they even break in different ways in Java 7 and 8, due to a compatibility-breaking change. Even without that change, the boundary conditions are difficult to get right when splitting.

200_success 146k22 gold badges191 silver badges481 bronze badges · Accepted Answer · 2015-04-08 22:52:23Z

Manipulating Strings directly is a bad idea, especially when you are doing it repeatedly. Java strings are immutable, so every time you do s += ..., you are making a copy of the entire provisional result. Anytime you need to do a substantial amount of editing on a string, use a StringBuilder.

To look for the next occurrence of word, use String.indexOf() or StringBuilder.indexOf(). It's a lot easier than extracting substrings and checking for equality. When extracting the substring, your code crashes if word is longer than three characters. Using .indexOf() is one way to eliminate that problem.

Here is an example of a more efficient solution:

public String plusOut(String str, String word) {
 StringBuilder s = new StringBuilder(str);
 // Find substrings to redact, from index i (inclusive)
 // to index j (exclusive)
 for (int i = 0; i < s.length(); i += word.length()) {
 int j = s.indexOf(word, i);
 if (j < 0) {
 // word not found; redact the rest of the string
 j = s.length();
 }
 while (i < j) {
 s.setCharAt(i++, '+');
 }
 }
 return s.toString();
}

Duly noted. Need to spend a lot of time with StringBuilder. Thanks for pointing me in the right direction.

Stack Exchange Network

Redacting everything in a string except a particular word

4 Answers 4

Pros

Cons

Simple split

Complex split:

Java 8

You must log in to answer this question.

Hot Network Questions

Redacting everything in a string except a particular word

4 Answers 4

Pros

Cons

Simple split

Complex split:

Java 8

You must log in to answer this question.

Related

Hot Network Questions