Split a string into a list of tokens

Question 1

I have the following assignment that I succeeded in solving, but the code is very inefficient. I would appreciate if someone could show me a more efficient way, perhaps with substring. Note that I am not allowed to use imports or regexes or add more functions.

/**
 * Separates a given string into tokens, which are the "words" that are
 * separated by one or more occurrences of the given separator character.
 * Returns the tokens as an array of String values.
 */
public static String[] tokenize (String str, char separator) {
 // Removes all the occurrences of the separator at the beginning and end of str
 String source = trim(str, separator);
 String[] tokens = new String[charRunCount (source,separator)+1];
 String tmp = ""; // a string in order to take a word, then run over this string
 int j = 0;
 int i = 0;
 while (i < tokens.length) {
 if ( source.charAt (j) != separator ) {
 do {
 tmp += source.charAt (j);
 if ( j >= source.length () - 1 ) {
 break;
 }
 else { // so that we math the source length
 j++;
 }
 } while (source.charAt (j) != separator); 
 }
 if ( source.charAt (j) == separator ) {
 j++;
 while (source.charAt (j) == separator) {
 j++;
 }
 }
 tokens[i] = tmp;// taking the token into place
 tmp = ""; //resetting the token so we can begin anew
 i++;
 }
 return tokens;
}

the charRunCount() function:

public static int charRunCount(String str, char c){
 char last = 0;
 int counter = 0;
 for (int i = 0; i < str.length(); i++) {
 // whenever a run starts.
 if (last != c && str.charAt(i) == c) {
 counter++;
 }
 last = str.charAt(i);
 }
 return counter;
}

Question 2

Welcome to Code Review! I changed the title so that it describes what the code does per site goals: "State what your code does in your title, not your main concerns about it.". Please check that I haven't misrepresented your code, and correct it if I have.

Question 3

Does Java have string views? That is, a non-owning view onto a string? If it does, you should consider using them. You marked your question with the performance tag so using string views rather than allocating a string for each token will considerably improve performance.

Question 4

tmp is never a good name for a variable. In this case, you should call it token instead, or perhaps word. And you rightly complain that building strings using repeated += operations is inefficient, and correctly suggest that .substring() would be better.

Logically, then, you need to find the starting and ending indexes of each token! So, let's define those helper functions (as private static functions):

/**
 * Considering str starting at startIndex, find the index at which the 
 * next token starts.
 *
 * @return The index of the start of a token (or str.length() if no more
 * tokens).
 */
private static int start(String str, char sep, int startIndex) {
 int i;
 for (i = startIndex; i < str.length() && str.charAt(i) == sep; i++);
 return i;
}
/**
 * Considering str starting at startIndex, find the index at which the 
 * current token ends.
 *
 * @return The index just beyond the end of a token (the index of a
 * sep character, or str.length() if this is the last token)
 */
private static int end(String str, char sep, int startIndex) {
 assert(str.charAt(sep) != sep);
 int i;
 for (i = startIndex; i < str.length() && str.charAt(i) != sep; i++);
 return i;
}

Then, we can use them in tokenize():

public static String[] tokenize(String str, char sep) {
 int tokenCount = 0;
 for (int s, e = 0; (s = start(str, sep, e)) < str.length(); e = end(str, sep, s)) {
 tokenCount++;
 }
 String[] tokens = new String[tokenCount];
 tokenCount = 0;
 for (int s, e = 0; (s = start(str, sep, e)) < str.length(); ) {
 tokens[tokenCount++] = str.substring(s, e = end(str, sep, s));
 }
 assert(tokens.length == tokenCount);
 return tokens;
}

Notice that now you can take advantage of the helper functions to predetermine the size of the array. Also, a lot of the repetitiveness of your conditions and loops is eliminated.

All of the analysis is done using string indexes, so there is no string manipulation other than just the essential .substring() calls — and even the trim() call is gone!

Question 5

thank you for your answer, I agree It's more efficient to separate the functions, however we are not allowed to do that in this assignment

Question 6

What, exactly, are you not allowed to do, and why didn't you state the restrictions up front in your question? Didn't you define a charRunCount() helper function in your own solution?

Question 7

Not sure of what you mean with "I am not allowed to use imports or regexes or add more functions." but if you can use String#indexOf this can be greatly simplified:

public class Tokenizer {
 private final char separator;
 public Tokenizer(char separator) {
 this.separator = separator;
 }
 public List<String> tokenize(String string) {
 List<String> tokens = new LinkedList<>();
 int start = 0, end = 0;
 while ( start < string.length() && (end = string.indexOf(separator, start))>-1 ) {
 tokens.add(string.substring(start, end));
 start = end+1;
 }
 tokens.add(string.substring(start));
 return tokens;
 }
}

Question 8

List would be imported from java.util.

Question 9

Indeed. Then we should either loop twice to count the occurrence of "separator" and create a good sized array then loop to populate it. Or manage the increase of the array size by ourselves.

Question 10

I'd like to suggest to do it in a a way, that

doesn't have two loops that basically do the same thing
is more Java-esque

Instead of storing the result in an array, which requires you to first count the potential tokens, store the tokens in a LinkedList.

Normally one would then just return that list instead of an array, because in Java arrays are usually used as an implementation detail hidden inside objects. If you yet need to return an array, LinkedList has a toArray method, that copies the list into an array.

My solution would look like this:

public static String[] tokenize(String str, char separator) {
 Collection<String> result = new LinkedList<>();
 int len = str.length();
 int pos = 0;
 int start = 0; 
 while (pos < len) {
 if (str.charAt(pos) != separator) {
 pos++;
 continue;
 }
 if (pos > start) {
 result.add(str.substring(start, pos));
 }
 do {
 pos++;
 start = pos;
 } while (pos < len && str.charAt(pos) == separator);
 }
 if (pos > start) {
 result.add(str.substring(start, pos));
 }
 return result.toArray(new String[result.size()]);
}

EDIT: Here's a version without additional import :)

static class LinkedList {
 private class Node {
 Node(String value) {
 this.value = value;
 }
 String value;
 Node next;
 }
 private Node first = null;
 private Node last = null;
 private int size = 0;
 public void add(String value) {
 Node node = new Node(value);
 size++;
 if (first == null) {
 first = last = node;
 return;
 }
 last.next = node;
 last = node;
 }
 public String[] toArray() {
 String[] array = new String[size];
 int i = 0;
 Node current = first;
 while (current != null) {
 array[i] = current.value;
 i++;
 current = current.next;
 }
 return array;
 }
}
public static String[] tokenize(String str, char separator) {
 LinkedList result = new LinkedList();
 int len = str.length();
 int pos = 0;
 int start = 0;
 while (pos < len) {
 if (str.charAt(pos) != separator) {
 pos++;
 continue;
 }
 if (pos > start) {
 result.add(str.substring(start, pos));
 }
 do {
 pos++;
 start = pos;
 } while (pos < len && str.charAt(pos) == separator);
 }
 if (pos > start) {
 result.add(str.substring(start, pos));
 }
 return result.toArray();
}

Question 11

Collection would be imported from java.util.

Question 12

@200_success Ah, I missed that requirement. I can't use LinkedList then anyway. But in that case I'd just write a simple linked list class myself, which isn't against the rules :)

Question 13

I found a simple way for beginners:

public static String[] tokenize (String str, char separator) {
 // Removes all the occurrences of the separator at the beginning and end of str
 String source = trim(str, separator);
 // In the following statement, replace the 0 with the correct number of tokens, and complete the missing code.
 String[] tokens = new String[charRunCount (source,separator)+1]; //since we start from 0.
 int i = 0;
 int j = 0;
 int sourceLen = source.length ();
 while (i < sourceLen) {
 String tmp = "";
 while (i < sourceLen && source.charAt (i) != separator) { // add each character in the source until you hit separator
 tmp += source.charAt (i++);
 }
 if ( !(tmp.equals ("")) ) { //so that if i=separator don't go into the token
 tokens[j++] = tmp;
 }
 i++;
 }
 return tokens;
}

score 3 · Accepted Answer · 2018-11-06 07:43:23Z

tmp is never a good name for a variable. In this case, you should call it token instead, or perhaps word. And you rightly complain that building strings using repeated += operations is inefficient, and correctly suggest that .substring() would be better.

Logically, then, you need to find the starting and ending indexes of each token! So, let's define those helper functions (as private static functions):

/**
 * Considering str starting at startIndex, find the index at which the 
 * next token starts.
 *
 * @return The index of the start of a token (or str.length() if no more
 * tokens).
 */
private static int start(String str, char sep, int startIndex) {
 int i;
 for (i = startIndex; i < str.length() && str.charAt(i) == sep; i++);
 return i;
}
/**
 * Considering str starting at startIndex, find the index at which the 
 * current token ends.
 *
 * @return The index just beyond the end of a token (the index of a
 * sep character, or str.length() if this is the last token)
 */
private static int end(String str, char sep, int startIndex) {
 assert(str.charAt(sep) != sep);
 int i;
 for (i = startIndex; i < str.length() && str.charAt(i) != sep; i++);
 return i;
}

Then, we can use them in tokenize():

public static String[] tokenize(String str, char sep) {
 int tokenCount = 0;
 for (int s, e = 0; (s = start(str, sep, e)) < str.length(); e = end(str, sep, s)) {
 tokenCount++;
 }
 String[] tokens = new String[tokenCount];
 tokenCount = 0;
 for (int s, e = 0; (s = start(str, sep, e)) < str.length(); ) {
 tokens[tokenCount++] = str.substring(s, e = end(str, sep, s));
 }
 assert(tokens.length == tokenCount);
 return tokens;
}

Notice that now you can take advantage of the helper functions to predetermine the size of the array. Also, a lot of the repetitiveness of your conditions and loops is eliminated.

All of the analysis is done using string indexes, so there is no string manipulation other than just the essential .substring() calls — and even the trim() call is gone!

thank you for your answer, I agree It's more efficient to separate the functions, however we are not allowed to do that in this assignment
What, exactly, are you not allowed to do, and why didn't you state the restrictions up front in your question? Didn't you define a charRunCount() helper function in your own solution?

Stack Exchange Network

Split a string into a list of tokens

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Split a string into a list of tokens

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions