Now we just have one copy for all instances of the class. The static block static block handles initialization.
Now we just have one copy for all instances of the class. The static block handles initialization.
Now we just have one copy for all instances of the class. The static block handles initialization.
Nitpick
private boolean exausthed = false;
Pedantic, but this should be spelled exhausted
.
Class vs. Object
private Set<Character> blankChars = new HashSet<Character>();
And then later in the constructor
blankChars.add('\r');
blankChars.add('\n');
blankChars.add((char) 8);
blankChars.add((char) 9);
blankChars.add((char) 11);
blankChars.add((char) 12);
blankChars.add((char) 32);
This creates a separate one of these for each instance of the class, but all of them have the same values. Instead
private static Set<Character> blankChars = new HashSet<Character>();
static {
blankChars.add('\r');
blankChars.add('\n');
blankChars.add((char) 8);
blankChars.add((char) 9);
blankChars.add((char) 11);
blankChars.add((char) 12);
blankChars.add((char) 32);
}
Now we just have one copy for all instances of the class. The static block handles initialization.
Iterator
while (!lexer.isExausthed()) {
System.out.printf("%-18s %s\n", lexer.currentToken(), lexer.currentLexema());
lexer.moveAhead();
}
This pattern almost matches an iterator, albeit with different names. Consider
while (lexer.hasNext()) {
System.out.printf("%-18s %s\n", lexer.currentToken(), lexer.currentLexema());
lexer.next();
}
Now it's the same names but the behavior is slightly off.
while (lexer.hasNext()) {
System.out.printf("%-18s %s\n", lexer.next(), lexer.currentLexema());
}
And you'd drop the moveAhead
from the constructor, as it is no longer necessary to prime the pump that way. Actually implementing Iterable<Token>
would allow you to say
for (Token token : lexer) {
System.out.printf("%-18s %s\n", token, lexer.currentLexema());
}
Build for unit testing
private boolean findNextToken() {
This would be difficult to unit test.
static boolean findNextToken(StringBuilder input) {
This would be easier. Its visibility is less restricted and it is possible to call it as
findNextToken(new StringBuilder("foo"));
It doesn't rely on object state.
Performance
The thing that looks non-performant to me is
for (Token t : Token.values()) {
int end = t.endOfMatch(input.toString());
if (end != -1) {
token = t;
lexema = input.substring(0, end);
input.delete(0, end);
return true;
}
}
This seems inefficient. Rather than trying each token, consider building a data structure that goes the other way.
for (Token t : possibleTokens.get(input.charAt(0))) {
So for a given character, what tokens could possibly match? For example, if the first character is a "w", then the token might be while
or it might be an identifier. It's not going to be a comma or relational operator. So that drops twenty-five comparisons down to just two. And in many cases, there would be only one.
Bug?
I didn't try it, but I think that you might mishandle an identifier like
let letter = "a";
I think that this would get tokenized as
let
let
ter
=
"a"
;
Presumably you'd prefer that the second lexema be letter
.
You may want to make "let"
be "let\\b"
instead. That should force it not to match if the word continues. You should also do this for other keywords, although this would be a less common problem for while or else.
Are numbers valid identifiers?
IDENTIFIER ("\\w+");
This would match, e.g. 9f
. But you'll never get a chance to see, as the 9
will get parsed as a number.
If that's an intentional behavior, you should comment it. Even better, unit test it. Then you are protected from a later edit putting IDENTIFIER
before NUMBER
.
Personally, I'd prefer that the regular expression only accept what it is supposed to accept. So if an identifier must start with a letter or underscore, the regular expression should capture that.