I need some tips on how I can improve the design of the following code. Everything works correctly in the program. It takes a string and shows each token, the only tokens being numbers and operators. The output shows all the tokens and what type of token they are.
public class Tokenizer {
int pos;
char[] expression;
Tokenizer(String expression) {
this.expression = expression.toCharArray();
this.pos = 0;
}
enum Type { OPERATOR, NUMBER, UNKNOWN }
class Lexeme {
String type, token;
Lexeme(String type, String token) {
this.type = type;
this.token = token;
}
}
Lexeme getToken() {
StringBuilder token = new StringBuilder();
boolean endOfToken = false;
Type type = Type.UNKNOWN;
while (!endOfToken && hasMoreToken())
{
while(expression[pos] == ' ')
pos++;
switch (expression[pos])
{
case '+':
case '-':
case '*':
case '/':
if(type != Type.NUMBER)
{
type = Type.OPERATOR;
token.append(expression[pos]);
pos++;
}
endOfToken = true;
break;
case ' ':
endOfToken = true;
pos++;
break;
default:
if(Character.isDigit(expression[pos]) || expression[pos] == '.')
{
token.append(expression[pos]);
type = Type.NUMBER;
}
else
{
System.out.println("Systax error at position: " + pos);
}
pos++;
break;
}
}
return new Lexeme(type.name().toLowerCase(), token.toString());
}
boolean hasMoreToken() {
return pos < expression.length;
}
public static void main(String[] args) {
// TODO code application logic here
String expression = "12.3 / 5 - 22";
Tokenizer tokenizer = new Tokenizer(expression);
while (tokenizer.hasMoreToken())
{
Lexeme nextToken = tokenizer.getToken();
System.out.print("Type: " + nextToken.type + "\tLexeme: " + nextToken.token + "\n");
}
}
}
1 Answer 1
This code looks quite nice, actually. If this was a homework assignment, you could almost turn it in right now. I only have some minor issues with the code, plus some bugs, and would then like to talk about alternatives to structure your lexer (the words tokenizer, lexer and scanner are all equivalent here).
- I don't see a single comment in your code, aside from that TODO.
hasMoreToken
should behasMoreTokens
. Notice the plural-"s". This makes reading the code more pleasant.- Each lexeme should have a
Type type
, not aString type
. By using a string, you loose all type safety. (Apparently there is a name for this antipattern: Stringly Typed). - You should probably use
toString()
instead ofname()
. This should not happen before the type is actually printed out, i.e. only in themain
. - A seasoned Java programmer will use a lot more visibility modifiers. Especially
pos
andexpression
ought to be private. but I personally don't care. - You have inconsistent formatting. You put the opening braces of class and method declarations on the same line, but you put them under the keyword for control flow statements. In Java, the braces usually open on the same line.
- It looks as if your switch was outdented one level. This happens because you didn't use a block for the preceding
while
. Always using blocks can make code easier to read. - General style note: Place blank lines between logically seperate blocks of your code. Inside the
Lexeme
class, a blank line between the fields and the methods could improve readability. This piece of code:
while(expression[pos] == ' ') pos++;
will attempt to access an array entry outside of the allowed range when the strings ends with whitespace. Add a bounds check here.
On an architecture level, it would be nice to make your Tokenizer
an Iterable
. Heck, it implicitly already is a kind of Iterator
. If we clean that up, we can do:
for (Token token : new Tokenizer(expression)) {
System.out.println(...);
}
We can turn it into an Iterator
by renaming hasMoreTokens
to hasNext
and getToken
to next
. If you can't rename them, a wrapper method will do as well. We also need a dummy method remove
that throws an UnsupportedOpertationException
. If we want to make it Iterable
, we need to add an iterator
method that returns the object itself, although this may be considered hackish.
The State Machine, and Why I Don't Like It
In your getToken
method, you have written a kind of state machine that performs the parsing. You have seemingly flattened it into a single state with one character lookahead, but it actually has three (plus two) different states. Here are all transitions you have written:
# transitions that consume a char are written A -- char -> B
# lookahead transitions are written A --(char)-> B
START --(not ' ')-> UNKNOWN
START -- ' ' -> START
# the operator cases
UNKNOWN -- '+'|'-'|'*'|'/' -> OPERATOR END
NUMBER --('+'|'-'|'*'|'/')-> END
# the whitespace case
UNKNOWN -- ' ' -> END # unreachable
OPERATOR -- ' ' -> END # unreachable
NUMBER -- ' ' -> END
# the default digit case
UNKNOWN -- '0'..'9'|'.' -> NUMBER
NUMBER -- '0'..'9'|'.' -> NUMBER
OPERATOR -- '0'..'9'|'.' -> NUMBER # unreachable
# the default error case
UNKNOWN -- not '0'..'9'|'.' -> UNKNOWN
NUMBER -- not '0'..'9'|'.' -> NUMBER
OPERATOR -- not '0'..'9'|'.' -> OPERATOR # unreachable
There might be a few issues with this.
- You can't output good error messages, because you don't have that good of an idea where you are, and because you skip over any junk in the string, allowing for
12foobar3
to be recognized as123
. Any extensions to the lexer would therefore be backwards-incompatible. - Extending this kind of lexer is difficult with increasing number of different token types, because all possible transitions have to be considered for that type, as all transitions happen in that one switch.
- It obscures the actual flow of control.
- You would consider
...
to be a number. - I don't feel comfortable with consuming the char in the whitespace case. It is logically separate from tokenizing a number, and therefore should be separated in your code.
- The existence of an
UNKNOWN
type is ridiculous. This is a case where having anull
type might be preferable, and complaining if it is still null at the end of the method.
It might be better to make your states more explicit as actual states in your code. Also note that single-character lookahead is sufficient for your grammar, but would have difficulties recognizing operators like <=
.
Another thing worth considering would be to put the body of your while(hasMoreTokens())
loop into a seperate method – this would remove the need for the endOfToken
flag, as you could simply return
. First, optimize for simplicity and readability, only second for performance.
Together, I might write code like this (untested sketch, no final code):
...
boolean illegalState = false;
public boolean hasNext() {
return pos < expression.length && !illegalState;
}
public Lexeme next() {
if (!hasNext()) {
throw new NoSuchElementException();
}
// this can be removed if we do it in the constructor,
// as we remove all WS *after* each token.
skipWhitespace();
// dispatch actual tokenization based on single character lookahead
Lexeme lexeme;
char c = expression[pos];
if (c == '+' || c == '-' || c == '*' || c == '/') {
pos++;
lexeme = new Lexeme(Type.OPERATOR, Character.toString(c));
} else if ('0' <= c && c <= '9') {
lexeme = tokenizeNumber();
} else {
illegalState = true;
throw new java.text.ParseException("Expected operator or number", pos);
}
// clean up after ourselves, so that hasNext() doesn't recognize
// trailing whitespace as another token:
skipWhitespace();
return lexeme;
}
private void skipWhitespace() {
while(pos < expression.length && expression[pos] == ' ') {
pos++;
}
}
private Lexeme tokenizeNumber() {
StringBuilder token = new StringBuilder();
// effectively, this method recognizes the language specified by the regex
// /[0-9]+(?:\.[0-9]+)?/
// (PCRE style)
// lex the integer part
int posBeforeInt = pos;
while (pos < expression.length && '0' <= expression[pos] && expression[pos ] <= '9') {
token.append(expression[pos++]);
}
if (pos == posBeforeInt) {
illegalState = true;
throw new java.text.ParseException("Every number must have an integer part", pos);
}
// optional: lex decimal
if (expression[pos] == '.') {
token.append(expression[pos++]);
int posBeforeDecimal = pos;
while (pos < expression.length && '0' <= expression[pos] && expression[pos ] <= '9') {
token.append(expression[pos++]);
}
if (pos == posBeforeDecimal) {
illegalState = true;
throw new java.text.ParseException("Number has decimal point, but no decimal digits", pos);
}
}
return new Lexeme(Type.NUMBER, token.toString());
}
You may have noticed that this lexer is much more similar to a top-down parser now, which I find easier to reason about. There is no difference between a lexer and parser, both transform a list of X to Y by giving them structure. (Lexer: chars → tokens, Parser: tokens → parse tree).
An abstraction possibility that opens up is the concept of a character class – a Set
of characters with common properties. E.g. '0' <= expression[pos] && expression[pos ] <= '9'
ought to be written as digits.contains(expression[pos])
. Also, there are dozens more whitespace characters than just the \x20
space character.