Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/

Source Link

edited May 23, 2017 at 12:40

Community Bot

edited May 23, 2017 at 12:40

Community Bot

Now we just have one copy for all instances of the class. The static block static block handles initialization.

Now we just have one copy for all instances of the class. The static block handles initialization.

Source Link

answered Aug 15, 2016 at 4:35

mdfst13

answered Aug 15, 2016 at 4:35

mdfst13

22.4k
6
34
70

Nitpick

 private boolean exausthed = false;

Pedantic, but this should be spelled exhausted.

Class vs. Object

 private Set<Character> blankChars = new HashSet<Character>();

And then later in the constructor

 blankChars.add('\r');

 blankChars.add('\n');
 blankChars.add((char) 8);
 blankChars.add((char) 9);
 blankChars.add((char) 11);
 blankChars.add((char) 12);
 blankChars.add((char) 32);

This creates a separate one of these for each instance of the class, but all of them have the same values. Instead

 private static Set<Character> blankChars = new HashSet<Character>();
 static {
 blankChars.add('\r');
 blankChars.add('\n');
 blankChars.add((char) 8);
 blankChars.add((char) 9);
 blankChars.add((char) 11);
 blankChars.add((char) 12);
 blankChars.add((char) 32);
 }

Now we just have one copy for all instances of the class. The static block handles initialization.

Iterator

 while (!lexer.isExausthed()) {

 System.out.printf("%-18s %s\n", lexer.currentToken(), lexer.currentLexema());
 lexer.moveAhead();
 }

This pattern almost matches an iterator, albeit with different names. Consider

 while (lexer.hasNext()) {
 System.out.printf("%-18s %s\n", lexer.currentToken(), lexer.currentLexema());
 lexer.next();
 }

Now it's the same names but the behavior is slightly off.

 while (lexer.hasNext()) {
 System.out.printf("%-18s %s\n", lexer.next(), lexer.currentLexema());
 }

And you'd drop the moveAhead from the constructor, as it is no longer necessary to prime the pump that way. Actually implementing Iterable<Token> would allow you to say

 for (Token token : lexer) {
 System.out.printf("%-18s %s\n", token, lexer.currentLexema());
 }

Build for unit testing

 private boolean findNextToken() {

This would be difficult to unit test.

 static boolean findNextToken(StringBuilder input) {

This would be easier. Its visibility is less restricted and it is possible to call it as

 findNextToken(new StringBuilder("foo"));

It doesn't rely on object state.

Performance

The thing that looks non-performant to me is

 for (Token t : Token.values()) {

 int end = t.endOfMatch(input.toString());

 if (end != -1) {
 token = t;
 lexema = input.substring(0, end);
 input.delete(0, end);
 return true;
 }
 }

This seems inefficient. Rather than trying each token, consider building a data structure that goes the other way.

 for (Token t : possibleTokens.get(input.charAt(0))) {

So for a given character, what tokens could possibly match? For example, if the first character is a "w", then the token might be while or it might be an identifier. It's not going to be a comma or relational operator. So that drops twenty-five comparisons down to just two. And in many cases, there would be only one.

Bug?

I didn't try it, but I think that you might mishandle an identifier like

let letter = "a";

I think that this would get tokenized as

let
let
ter
=
"a"
;

Presumably you'd prefer that the second lexema be letter.

You may want to make "let" be "let\\b" instead. That should force it not to match if the word continues. You should also do this for other keywords, although this would be a less common problem for while or else.

Are numbers valid identifiers?

 IDENTIFIER ("\\w+");

This would match, e.g. 9f. But you'll never get a chance to see, as the 9 will get parsed as a number.

If that's an intentional behavior, you should comment it. Even better, unit test it. Then you are protected from a later edit putting IDENTIFIER before NUMBER.

Personally, I'd prefer that the regular expression only accept what it is supposed to accept. So if an identifier must start with a letter or underscore, the regular expression should capture that.

lang-java