Java lexical analyser

Question 1

I created a lexical analyser in Java recently, but I don't think the performance is very good.

The code works, but when I debugged the program, it take around ~100 milliseconds for only two tokens...

Can you read my code and give me tips about performance?

Lexer.java:

package me.minkizz.minlang;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.stream.Stream;
public class Lexer {
 private StringBuilder input = new StringBuilder();
 private Token token;
 private String lexema;
 private boolean exhausted;
 private String errorMessage = "";
 private static Set<Character> blankChars = new HashSet<Character>();
 static {
 blankChars.add('\r');
 blankChars.add('\n');
 blankChars.add((char) 8);
 blankChars.add((char) 9);
 blankChars.add((char) 11);
 blankChars.add((char) 12);
 blankChars.add((char) 32);
 }
 public Lexer(String filePath) {
 try (Stream<String> st = Files.lines(Paths.get(filePath))) {
 st.forEach(input::append);
 } catch (IOException ex) {
 exhausted = true;
 errorMessage = "Could not read file: " + filePath;
 return;
 }
 moveAhead();
 }
 public void moveAhead() {
 if (exhausted) {
 return;
 }
 if (input.length() == 0) {
 exhausted = true;
 return;
 }
 ignoreWhiteSpaces();
 if (findNextToken()) {
 return;
 }
 exhausted = true;
 if (input.length() > 0) {
 errorMessage = "Unexpected symbol: '" + input.charAt(0) + "'";
 }
 }
 private void ignoreWhiteSpaces() {
 int charsToDelete = 0;
 while (blankChars.contains(input.charAt(charsToDelete))) {
 charsToDelete++;
 }
 if (charsToDelete > 0) {
 input.delete(0, charsToDelete);
 }
 }
 private boolean findNextToken() {
 for (Token t : Token.values()) {
 int end = t.endOfMatch(input.toString());
 if (end != -1) {
 token = t;
 lexema = input.substring(0, end);
 input.delete(0, end);
 return true;
 }
 }
 return false;
 }
 public Token currentToken() {
 return token;
 }
 public String currentLexema() {
 return lexema;
 }
 public boolean isSuccessful() {
 return errorMessage.isEmpty();
 }
 public String errorMessage() {
 return errorMessage;
 }
 public boolean isExhausted() {
 return exhausted;
 }
}

Token.java:

package me.minkizz.minlang;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public enum Token {
 PRINT_KEYWORD("print\\b"), PRINTLN_KEYWORD("println\\b"), OPEN_PARENTHESIS("\\("), CLOSE_PARENTHESIS("\\)"),
 STRING("\"[^\"]+\""), NUMBER("\\d+(\\.\\d+)?");
 private final Pattern pattern;
 Token(String regex) {
 pattern = Pattern.compile("^" + regex);
 }
 int endOfMatch(String s) {
 Matcher m = pattern.matcher(s);
 if (m.find()) {
 return m.end();
 }
 return -1;
 }
}

Main.java:

package me.minkizz.minlang;
public class Main {
 public static void main(String[] args) {
 new Main();
 }
 public Main() {
 long start = System.nanoTime();
 Interpreter.execute("C:\\Users\\leodu\\OneDrive\\Bureau\\minlang.txt");
 long end = System.nanoTime();
 System.out
 .println("Program executed in " + (end - start) + "ns (" + Math.round((end - start) / 1000000) + "ms)");
 }
}

Interpreter.java:

package me.minkizz.minlang;
public class Interpreter {
 private static Token previousToken;
 public static void execute(String fileName) {
 Lexer lexer = new Lexer(fileName);
 while (!lexer.isExhausted()) {
 Token token = lexer.currentToken();
 String lexema = lexer.currentLexema();
 if (previousToken != null) {
 if (token == Token.STRING || token == Token.NUMBER) {
 if (previousToken == Token.PRINT_KEYWORD) {
 System.out.print(lexema);
 } else if (previousToken == Token.PRINTLN_KEYWORD) {
 System.out.println(lexema);
 }
 }
 }
 previousToken = token;
 lexer.moveAhead();
 }
 }
}

Example input:

print "a"
print "b"

Question 2

Never trust sub-second timings. Seriously, micro-benchmarking JIT compiled&improved execution is easy to get wrong - use a framework.

Question 3

Can you provide a generator for relevant input or some other access?

Question 4

Can you specify what language this code is designed to handle?

Question 5

@200_success, it's a custom programming language... And it's not the problem, I'm talking about Lexer/Token here

Question 6

Please edit the question to include some example input. Normally you should not edit the question after an answer has been posted, but in this case, adding an example input file does not invalidate any answer, so that's ok.

Question 7

Here are my comments

ignoreWhiteSpaces(): instead of loop on individual chars, can be replaced with regex to find first char not in list.
Deleting from the StringBuilder is unnecessary. Matcher has find(int start)
now, once you adopt point 2, then you don't need StringBuilder at all. you can read the input into one line, using Files.readAllBytes() (which probably performs better than one line at a time) and just keep an index pointer that moves along the input. so, for example, ignoreWhiteSpaces() will return index of firstt non-whitespace char that is after the index pointer.

Question 8

Instead of find use lookingAt in order not to skip unmatched "garbage".
Keywords/identifier: first something like a word (indentifier), and then check for keywords (Map<String, Keyword or Token>).
Whitespace could be a pattern too \\s* but that is not necessarily faster. Compacter though. Or use Character.isWhitespace.
The usage of StringBuilder serves no purpose. Rather than deleting maintain a position. One can use that for lookingAt.
Files.lines is fine, but turn it into a Stream<Token> immediately then. Mind, that the default encoding is UTF-8 (but IMHO that is the preferable - international - encoding). The line read is stripped of the terminating line break (like \n or \r\n). Which might be taken into account for a whitespace sensitive grammar.
```
 st.forEach(line -> input.append(line).append('\n'));
```

Sharon Ben Asher Sharon Ben Asher 3,66712 silver badges15 bronze badges · Answer 1 · 2019-05-26 07:55:02Z

Here are my comments

ignoreWhiteSpaces(): instead of loop on individual chars, can be replaced with regex to find first char not in list.
Deleting from the StringBuilder is unnecessary. Matcher has find(int start)
now, once you adopt point 2, then you don't need StringBuilder at all. you can read the input into one line, using Files.readAllBytes() (which probably performs better than one line at a time) and just keep an index pointer that moves along the input. so, for example, ignoreWhiteSpaces() will return index of firstt non-whitespace char that is after the index pointer.

Joop Eggen Joop Eggen 4,57615 silver badges19 bronze badges · Answer 2 · 2019-05-27 11:28:37Z

Instead of find use lookingAt in order not to skip unmatched "garbage".
Keywords/identifier: first something like a word (indentifier), and then check for keywords (Map<String, Keyword or Token>).
Whitespace could be a pattern too \\s* but that is not necessarily faster. Compacter though. Or use Character.isWhitespace.
The usage of StringBuilder serves no purpose. Rather than deleting maintain a position. One can use that for lookingAt.
Files.lines is fine, but turn it into a Stream<Token> immediately then. Mind, that the default encoding is UTF-8 (but IMHO that is the preferable - international - encoding). The line read is stripped of the terminating line break (like \n or \r\n). Which might be taken into account for a whitespace sensitive grammar.
```
 st.forEach(line -> input.append(line).append('\n'));
```

Stack Exchange Network

Java lexical analyser

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Java lexical analyser

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions