String tokenizer design

Question 1

I need some tips on how I can improve the design of the following code. Everything works correctly in the program. It takes a string and shows each token, the only tokens being numbers and operators. The output shows all the tokens and what type of token they are.

public class Tokenizer {
 int pos;
 char[] expression;
 Tokenizer(String expression) {
 this.expression = expression.toCharArray();
 this.pos = 0;
 }
 enum Type { OPERATOR, NUMBER, UNKNOWN }
 class Lexeme {
 String type, token;
 Lexeme(String type, String token) {
 this.type = type;
 this.token = token;
 }
 }
 Lexeme getToken() {
 StringBuilder token = new StringBuilder();
 boolean endOfToken = false;
 Type type = Type.UNKNOWN;
 while (!endOfToken && hasMoreToken()) 
 {
 while(expression[pos] == ' ')
 pos++;
 switch (expression[pos]) 
 {
 case '+':
 case '-':
 case '*':
 case '/':
 if(type != Type.NUMBER) 
 {
 type = Type.OPERATOR;
 token.append(expression[pos]);
 pos++;
 }
 endOfToken = true;
 break;
 case ' ':
 endOfToken = true;
 pos++;
 break;
 default:
 if(Character.isDigit(expression[pos]) || expression[pos] == '.') 
 {
 token.append(expression[pos]);
 type = Type.NUMBER;
 } 
 else 
 {
 System.out.println("Systax error at position: " + pos);
 }
 pos++;
 break;
 }
 }
 return new Lexeme(type.name().toLowerCase(), token.toString());
 }
 boolean hasMoreToken() {
 return pos < expression.length;
 }
 public static void main(String[] args) {
 // TODO code application logic here
 String expression = "12.3 / 5 - 22";
 Tokenizer tokenizer = new Tokenizer(expression);
 while (tokenizer.hasMoreToken()) 
 {
 Lexeme nextToken = tokenizer.getToken();
 System.out.print("Type: " + nextToken.type + "\tLexeme: " + nextToken.token + "\n");
 }
 }
}

Question 2

This code looks quite nice, actually. If this was a homework assignment, you could almost turn it in right now. I only have some minor issues with the code, plus some bugs, and would then like to talk about alternatives to structure your lexer (the words tokenizer, lexer and scanner are all equivalent here).

I don't see a single comment in your code, aside from that TODO.
hasMoreToken should be hasMoreTokens. Notice the plural-"s". This makes reading the code more pleasant.
Each lexeme should have a Type type, not a String type. By using a string, you loose all type safety. (Apparently there is a name for this antipattern: Stringly Typed).
You should probably use toString() instead of name(). This should not happen before the type is actually printed out, i.e. only in the main.
A seasoned Java programmer will use a lot more visibility modifiers. Especially pos and expression ought to be private. but I personally don't care.
You have inconsistent formatting. You put the opening braces of class and method declarations on the same line, but you put them under the keyword for control flow statements. In Java, the braces usually open on the same line.
It looks as if your switch was outdented one level. This happens because you didn't use a block for the preceding while. Always using blocks can make code easier to read.
General style note: Place blank lines between logically seperate blocks of your code. Inside the Lexeme class, a blank line between the fields and the methods could improve readability.
This piece of code:
```
while(expression[pos] == ' ')
 pos++;
```
will attempt to access an array entry outside of the allowed range when the strings ends with whitespace. Add a bounds check here.

On an architecture level, it would be nice to make your Tokenizer an Iterable. Heck, it implicitly already is a kind of Iterator. If we clean that up, we can do:

for (Token token : new Tokenizer(expression)) {
 System.out.println(...);
}

We can turn it into an Iterator by renaming hasMoreTokens to hasNext and getToken to next. If you can't rename them, a wrapper method will do as well. We also need a dummy method remove that throws an UnsupportedOpertationException. If we want to make it Iterable, we need to add an iterator method that returns the object itself, although this may be considered hackish.

The State Machine, and Why I Don't Like It

In your getToken method, you have written a kind of state machine that performs the parsing. You have seemingly flattened it into a single state with one character lookahead, but it actually has three (plus two) different states. Here are all transitions you have written:

# transitions that consume a char are written A -- char -> B
# lookahead transitions are written A --(char)-> B
START --(not ' ')-> UNKNOWN
START -- ' ' -> START
# the operator cases
UNKNOWN -- '+'|'-'|'*'|'/' -> OPERATOR END
NUMBER --('+'|'-'|'*'|'/')-> END
# the whitespace case
UNKNOWN -- ' ' -> END # unreachable
OPERATOR -- ' ' -> END # unreachable
NUMBER -- ' ' -> END
# the default digit case
UNKNOWN -- '0'..'9'|'.' -> NUMBER
NUMBER -- '0'..'9'|'.' -> NUMBER
OPERATOR -- '0'..'9'|'.' -> NUMBER # unreachable
# the default error case
UNKNOWN -- not '0'..'9'|'.' -> UNKNOWN
NUMBER -- not '0'..'9'|'.' -> NUMBER
OPERATOR -- not '0'..'9'|'.' -> OPERATOR # unreachable

There might be a few issues with this.

You can't output good error messages, because you don't have that good of an idea where you are, and because you skip over any junk in the string, allowing for 12foobar3 to be recognized as 123. Any extensions to the lexer would therefore be backwards-incompatible.
Extending this kind of lexer is difficult with increasing number of different token types, because all possible transitions have to be considered for that type, as all transitions happen in that one switch.
It obscures the actual flow of control.
You would consider ... to be a number.
I don't feel comfortable with consuming the char in the whitespace case. It is logically separate from tokenizing a number, and therefore should be separated in your code.
The existence of an UNKNOWN type is ridiculous. This is a case where having a null type might be preferable, and complaining if it is still null at the end of the method.

It might be better to make your states more explicit as actual states in your code. Also note that single-character lookahead is sufficient for your grammar, but would have difficulties recognizing operators like <=.

Another thing worth considering would be to put the body of your while(hasMoreTokens()) loop into a seperate method – this would remove the need for the endOfToken flag, as you could simply return. First, optimize for simplicity and readability, only second for performance.

Together, I might write code like this (untested sketch, no final code):

...
boolean illegalState = false;
public boolean hasNext() {
 return pos < expression.length && !illegalState;
}
public Lexeme next() {
 if (!hasNext()) {
 throw new NoSuchElementException();
 }
 // this can be removed if we do it in the constructor,
 // as we remove all WS *after* each token.
 skipWhitespace();
 // dispatch actual tokenization based on single character lookahead
 Lexeme lexeme;
 char c = expression[pos];
 if (c == '+' || c == '-' || c == '*' || c == '/') {
 pos++;
 lexeme = new Lexeme(Type.OPERATOR, Character.toString(c));
 } else if ('0' <= c && c <= '9') {
 lexeme = tokenizeNumber();
 } else {
 illegalState = true;
 throw new java.text.ParseException("Expected operator or number", pos);
 }
 // clean up after ourselves, so that hasNext() doesn't recognize
 // trailing whitespace as another token:
 skipWhitespace();
 return lexeme;
}
private void skipWhitespace() {
 while(pos < expression.length && expression[pos] == ' ') {
 pos++;
 }
}
private Lexeme tokenizeNumber() {
 StringBuilder token = new StringBuilder();
 // effectively, this method recognizes the language specified by the regex
 // /[0-9]+(?:\.[0-9]+)?/
 // (PCRE style)
 // lex the integer part
 int posBeforeInt = pos;
 while (pos < expression.length && '0' <= expression[pos] && expression[pos ] <= '9') {
 token.append(expression[pos++]);
 }
 if (pos == posBeforeInt) {
 illegalState = true;
 throw new java.text.ParseException("Every number must have an integer part", pos);
 }
 // optional: lex decimal
 if (expression[pos] == '.') {
 token.append(expression[pos++]);
 int posBeforeDecimal = pos;
 while (pos < expression.length && '0' <= expression[pos] && expression[pos ] <= '9') {
 token.append(expression[pos++]);
 }
 if (pos == posBeforeDecimal) {
 illegalState = true;
 throw new java.text.ParseException("Number has decimal point, but no decimal digits", pos);
 }
 }
 return new Lexeme(Type.NUMBER, token.toString());
}

You may have noticed that this lexer is much more similar to a top-down parser now, which I find easier to reason about. There is no difference between a lexer and parser, both transform a list of X to Y by giving them structure. (Lexer: chars → tokens, Parser: tokens → parse tree).

An abstraction possibility that opens up is the concept of a character class – a Set of characters with common properties. E.g. '0' <= expression[pos] && expression[pos ] <= '9' ought to be written as digits.contains(expression[pos]). Also, there are dozens more whitespace characters than just the \x20 space character.

amon amon 12.7k37 silver badges67 bronze badges · Answer 1 · 2013-09-16 22:19:36Z

This code looks quite nice, actually. If this was a homework assignment, you could almost turn it in right now. I only have some minor issues with the code, plus some bugs, and would then like to talk about alternatives to structure your lexer (the words tokenizer, lexer and scanner are all equivalent here).

I don't see a single comment in your code, aside from that TODO.
hasMoreToken should be hasMoreTokens. Notice the plural-"s". This makes reading the code more pleasant.
Each lexeme should have a Type type, not a String type. By using a string, you loose all type safety. (Apparently there is a name for this antipattern: Stringly Typed).
You should probably use toString() instead of name(). This should not happen before the type is actually printed out, i.e. only in the main.
A seasoned Java programmer will use a lot more visibility modifiers. Especially pos and expression ought to be private. but I personally don't care.
You have inconsistent formatting. You put the opening braces of class and method declarations on the same line, but you put them under the keyword for control flow statements. In Java, the braces usually open on the same line.
It looks as if your switch was outdented one level. This happens because you didn't use a block for the preceding while. Always using blocks can make code easier to read.
General style note: Place blank lines between logically seperate blocks of your code. Inside the Lexeme class, a blank line between the fields and the methods could improve readability.
This piece of code:
```
while(expression[pos] == ' ')
 pos++;
```
will attempt to access an array entry outside of the allowed range when the strings ends with whitespace. Add a bounds check here.

On an architecture level, it would be nice to make your Tokenizer an Iterable. Heck, it implicitly already is a kind of Iterator. If we clean that up, we can do:

for (Token token : new Tokenizer(expression)) {
 System.out.println(...);
}

We can turn it into an Iterator by renaming hasMoreTokens to hasNext and getToken to next. If you can't rename them, a wrapper method will do as well. We also need a dummy method remove that throws an UnsupportedOpertationException. If we want to make it Iterable, we need to add an iterator method that returns the object itself, although this may be considered hackish.

The State Machine, and Why I Don't Like It

In your getToken method, you have written a kind of state machine that performs the parsing. You have seemingly flattened it into a single state with one character lookahead, but it actually has three (plus two) different states. Here are all transitions you have written:

# transitions that consume a char are written A -- char -> B
# lookahead transitions are written A --(char)-> B
START --(not ' ')-> UNKNOWN
START -- ' ' -> START
# the operator cases
UNKNOWN -- '+'|'-'|'*'|'/' -> OPERATOR END
NUMBER --('+'|'-'|'*'|'/')-> END
# the whitespace case
UNKNOWN -- ' ' -> END # unreachable
OPERATOR -- ' ' -> END # unreachable
NUMBER -- ' ' -> END
# the default digit case
UNKNOWN -- '0'..'9'|'.' -> NUMBER
NUMBER -- '0'..'9'|'.' -> NUMBER
OPERATOR -- '0'..'9'|'.' -> NUMBER # unreachable
# the default error case
UNKNOWN -- not '0'..'9'|'.' -> UNKNOWN
NUMBER -- not '0'..'9'|'.' -> NUMBER
OPERATOR -- not '0'..'9'|'.' -> OPERATOR # unreachable

There might be a few issues with this.

You can't output good error messages, because you don't have that good of an idea where you are, and because you skip over any junk in the string, allowing for 12foobar3 to be recognized as 123. Any extensions to the lexer would therefore be backwards-incompatible.
Extending this kind of lexer is difficult with increasing number of different token types, because all possible transitions have to be considered for that type, as all transitions happen in that one switch.
It obscures the actual flow of control.
You would consider ... to be a number.
I don't feel comfortable with consuming the char in the whitespace case. It is logically separate from tokenizing a number, and therefore should be separated in your code.
The existence of an UNKNOWN type is ridiculous. This is a case where having a null type might be preferable, and complaining if it is still null at the end of the method.

It might be better to make your states more explicit as actual states in your code. Also note that single-character lookahead is sufficient for your grammar, but would have difficulties recognizing operators like <=.

Another thing worth considering would be to put the body of your while(hasMoreTokens()) loop into a seperate method – this would remove the need for the endOfToken flag, as you could simply return. First, optimize for simplicity and readability, only second for performance.

Together, I might write code like this (untested sketch, no final code):

...
boolean illegalState = false;
public boolean hasNext() {
 return pos < expression.length && !illegalState;
}
public Lexeme next() {
 if (!hasNext()) {
 throw new NoSuchElementException();
 }
 // this can be removed if we do it in the constructor,
 // as we remove all WS *after* each token.
 skipWhitespace();
 // dispatch actual tokenization based on single character lookahead
 Lexeme lexeme;
 char c = expression[pos];
 if (c == '+' || c == '-' || c == '*' || c == '/') {
 pos++;
 lexeme = new Lexeme(Type.OPERATOR, Character.toString(c));
 } else if ('0' <= c && c <= '9') {
 lexeme = tokenizeNumber();
 } else {
 illegalState = true;
 throw new java.text.ParseException("Expected operator or number", pos);
 }
 // clean up after ourselves, so that hasNext() doesn't recognize
 // trailing whitespace as another token:
 skipWhitespace();
 return lexeme;
}
private void skipWhitespace() {
 while(pos < expression.length && expression[pos] == ' ') {
 pos++;
 }
}
private Lexeme tokenizeNumber() {
 StringBuilder token = new StringBuilder();
 // effectively, this method recognizes the language specified by the regex
 // /[0-9]+(?:\.[0-9]+)?/
 // (PCRE style)
 // lex the integer part
 int posBeforeInt = pos;
 while (pos < expression.length && '0' <= expression[pos] && expression[pos ] <= '9') {
 token.append(expression[pos++]);
 }
 if (pos == posBeforeInt) {
 illegalState = true;
 throw new java.text.ParseException("Every number must have an integer part", pos);
 }
 // optional: lex decimal
 if (expression[pos] == '.') {
 token.append(expression[pos++]);
 int posBeforeDecimal = pos;
 while (pos < expression.length && '0' <= expression[pos] && expression[pos ] <= '9') {
 token.append(expression[pos++]);
 }
 if (pos == posBeforeDecimal) {
 illegalState = true;
 throw new java.text.ParseException("Number has decimal point, but no decimal digits", pos);
 }
 }
 return new Lexeme(Type.NUMBER, token.toString());
}

You may have noticed that this lexer is much more similar to a top-down parser now, which I find easier to reason about. There is no difference between a lexer and parser, both transform a list of X to Y by giving them structure. (Lexer: chars → tokens, Parser: tokens → parse tree).

An abstraction possibility that opens up is the concept of a character class – a Set of characters with common properties. E.g. '0' <= expression[pos] && expression[pos ] <= '9' ought to be written as digits.contains(expression[pos]). Also, there are dozens more whitespace characters than just the \x20 space character.

Stack Exchange Network

String tokenizer design

1 Answer 1

The State Machine, and Why I Don't Like It

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

String tokenizer design

1 Answer 1

The State Machine, and Why I Don't Like It

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions