I have the following assignment that I succeeded in solving, but the code is very inefficient. I would appreciate if someone could show me a more efficient way, perhaps with substring. Note that I am not allowed to use imports or regexes or add more functions.
/**
* Separates a given string into tokens, which are the "words" that are
* separated by one or more occurrences of the given separator character.
* Returns the tokens as an array of String values.
*/
public static String[] tokenize (String str, char separator) {
// Removes all the occurrences of the separator at the beginning and end of str
String source = trim(str, separator);
String[] tokens = new String[charRunCount (source,separator)+1];
String tmp = ""; // a string in order to take a word, then run over this string
int j = 0;
int i = 0;
while (i < tokens.length) {
if ( source.charAt (j) != separator ) {
do {
tmp += source.charAt (j);
if ( j >= source.length () - 1 ) {
break;
}
else { // so that we math the source length
j++;
}
} while (source.charAt (j) != separator);
}
if ( source.charAt (j) == separator ) {
j++;
while (source.charAt (j) == separator) {
j++;
}
}
tokens[i] = tmp;// taking the token into place
tmp = ""; //resetting the token so we can begin anew
i++;
}
return tokens;
}
the charRunCount()
function:
public static int charRunCount(String str, char c){
char last = 0;
int counter = 0;
for (int i = 0; i < str.length(); i++) {
// whenever a run starts.
if (last != c && str.charAt(i) == c) {
counter++;
}
last = str.charAt(i);
}
return counter;
}
-
\$\begingroup\$ Welcome to Code Review! I changed the title so that it describes what the code does per site goals: "State what your code does in your title, not your main concerns about it.". Please check that I haven't misrepresented your code, and correct it if I have. \$\endgroup\$Toby Speight– Toby Speight2018年11月05日 16:49:35 +00:00Commented Nov 5, 2018 at 16:49
-
\$\begingroup\$ Does Java have string views? That is, a non-owning view onto a string? If it does, you should consider using them. You marked your question with the performance tag so using string views rather than allocating a string for each token will considerably improve performance. \$\endgroup\$Indiana Kernick– Indiana Kernick2018年11月06日 07:26:39 +00:00Commented Nov 6, 2018 at 7:26
4 Answers 4
tmp
is never a good name for a variable. In this case, you should call it token
instead, or perhaps word
. And you rightly complain that building strings using repeated +=
operations is inefficient, and correctly suggest that .substring()
would be better.
Logically, then, you need to find the starting and ending indexes of each token! So, let's define those helper functions (as private static
functions):
/**
* Considering str starting at startIndex, find the index at which the
* next token starts.
*
* @return The index of the start of a token (or str.length() if no more
* tokens).
*/
private static int start(String str, char sep, int startIndex) {
int i;
for (i = startIndex; i < str.length() && str.charAt(i) == sep; i++);
return i;
}
/**
* Considering str starting at startIndex, find the index at which the
* current token ends.
*
* @return The index just beyond the end of a token (the index of a
* sep character, or str.length() if this is the last token)
*/
private static int end(String str, char sep, int startIndex) {
assert(str.charAt(sep) != sep);
int i;
for (i = startIndex; i < str.length() && str.charAt(i) != sep; i++);
return i;
}
Then, we can use them in tokenize()
:
public static String[] tokenize(String str, char sep) {
int tokenCount = 0;
for (int s, e = 0; (s = start(str, sep, e)) < str.length(); e = end(str, sep, s)) {
tokenCount++;
}
String[] tokens = new String[tokenCount];
tokenCount = 0;
for (int s, e = 0; (s = start(str, sep, e)) < str.length(); ) {
tokens[tokenCount++] = str.substring(s, e = end(str, sep, s));
}
assert(tokens.length == tokenCount);
return tokens;
}
Notice that now you can take advantage of the helper functions to predetermine the size of the array. Also, a lot of the repetitiveness of your conditions and loops is eliminated.
All of the analysis is done using string indexes, so there is no string manipulation other than just the essential .substring()
calls — and even the trim()
call is gone!
-
\$\begingroup\$ thank you for your answer, I agree It's more efficient to separate the functions, however we are not allowed to do that in this assignment \$\endgroup\$Yuki1112– Yuki11122018年11月06日 09:36:13 +00:00Commented Nov 6, 2018 at 9:36
-
\$\begingroup\$ What, exactly, are you not allowed to do, and why didn't you state the restrictions up front in your question? Didn't you define a
charRunCount()
helper function in your own solution? \$\endgroup\$200_success– 200_success2018年11月06日 13:52:20 +00:00Commented Nov 6, 2018 at 13:52
Not sure of what you mean with "I am not allowed to use imports or regexes or add more functions." but if you can use String#indexOf
this can be greatly simplified:
public class Tokenizer {
private final char separator;
public Tokenizer(char separator) {
this.separator = separator;
}
public List<String> tokenize(String string) {
List<String> tokens = new LinkedList<>();
int start = 0, end = 0;
while ( start < string.length() && (end = string.indexOf(separator, start))>-1 ) {
tokens.add(string.substring(start, end));
start = end+1;
}
tokens.add(string.substring(start));
return tokens;
}
}
-
1\$\begingroup\$
List
would be imported fromjava.util
. \$\endgroup\$200_success– 200_success2018年11月07日 13:11:39 +00:00Commented Nov 7, 2018 at 13:11 -
\$\begingroup\$ Indeed. Then we should either loop twice to count the occurrence of "separator" and create a good sized array then loop to populate it. Or manage the increase of the array size by ourselves. \$\endgroup\$gervais.b– gervais.b2018年11月07日 22:03:39 +00:00Commented Nov 7, 2018 at 22:03
I'd like to suggest to do it in a a way, that
- doesn't have two loops that basically do the same thing
- is more Java-esque
Instead of storing the result in an array, which requires you to first count the potential tokens, store the tokens in a LinkedList
.
Normally one would then just return that list instead of an array, because in Java arrays are usually used as an implementation detail hidden inside objects. If you yet need to return an array, LinkedList
has a toArray
method, that copies the list into an array.
My solution would look like this:
public static String[] tokenize(String str, char separator) {
Collection<String> result = new LinkedList<>();
int len = str.length();
int pos = 0;
int start = 0;
while (pos < len) {
if (str.charAt(pos) != separator) {
pos++;
continue;
}
if (pos > start) {
result.add(str.substring(start, pos));
}
do {
pos++;
start = pos;
} while (pos < len && str.charAt(pos) == separator);
}
if (pos > start) {
result.add(str.substring(start, pos));
}
return result.toArray(new String[result.size()]);
}
EDIT: Here's a version without additional import :)
static class LinkedList {
private class Node {
Node(String value) {
this.value = value;
}
String value;
Node next;
}
private Node first = null;
private Node last = null;
private int size = 0;
public void add(String value) {
Node node = new Node(value);
size++;
if (first == null) {
first = last = node;
return;
}
last.next = node;
last = node;
}
public String[] toArray() {
String[] array = new String[size];
int i = 0;
Node current = first;
while (current != null) {
array[i] = current.value;
i++;
current = current.next;
}
return array;
}
}
public static String[] tokenize(String str, char separator) {
LinkedList result = new LinkedList();
int len = str.length();
int pos = 0;
int start = 0;
while (pos < len) {
if (str.charAt(pos) != separator) {
pos++;
continue;
}
if (pos > start) {
result.add(str.substring(start, pos));
}
do {
pos++;
start = pos;
} while (pos < len && str.charAt(pos) == separator);
}
if (pos > start) {
result.add(str.substring(start, pos));
}
return result.toArray();
}
-
1\$\begingroup\$
Collection
would be imported fromjava.util
. \$\endgroup\$200_success– 200_success2018年11月06日 13:53:28 +00:00Commented Nov 6, 2018 at 13:53 -
\$\begingroup\$ @200_success Ah, I missed that requirement. I can't use
LinkedList
then anyway. But in that case I'd just write a simple linked list class myself, which isn't against the rules :) \$\endgroup\$RoToRa– RoToRa2018年11月07日 07:38:07 +00:00Commented Nov 7, 2018 at 7:38
I found a simple way for beginners:
public static String[] tokenize (String str, char separator) {
// Removes all the occurrences of the separator at the beginning and end of str
String source = trim(str, separator);
// In the following statement, replace the 0 with the correct number of tokens, and complete the missing code.
String[] tokens = new String[charRunCount (source,separator)+1]; //since we start from 0.
int i = 0;
int j = 0;
int sourceLen = source.length ();
while (i < sourceLen) {
String tmp = "";
while (i < sourceLen && source.charAt (i) != separator) { // add each character in the source until you hit separator
tmp += source.charAt (i++);
}
if ( !(tmp.equals ("")) ) { //so that if i=separator don't go into the token
tokens[j++] = tmp;
}
i++;
}
return tokens;
}
Explore related questions
See similar questions with these tags.