I created a lexical analyser in Java recently, but I don't think the performance is very good.
The code works, but when I debugged the program, it take around ~100 milliseconds for only two tokens...
Can you read my code and give me tips about performance?
Lexer.java:
package me.minkizz.minlang;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.stream.Stream;
public class Lexer {
private StringBuilder input = new StringBuilder();
private Token token;
private String lexema;
private boolean exhausted;
private String errorMessage = "";
private static Set<Character> blankChars = new HashSet<Character>();
static {
blankChars.add('\r');
blankChars.add('\n');
blankChars.add((char) 8);
blankChars.add((char) 9);
blankChars.add((char) 11);
blankChars.add((char) 12);
blankChars.add((char) 32);
}
public Lexer(String filePath) {
try (Stream<String> st = Files.lines(Paths.get(filePath))) {
st.forEach(input::append);
} catch (IOException ex) {
exhausted = true;
errorMessage = "Could not read file: " + filePath;
return;
}
moveAhead();
}
public void moveAhead() {
if (exhausted) {
return;
}
if (input.length() == 0) {
exhausted = true;
return;
}
ignoreWhiteSpaces();
if (findNextToken()) {
return;
}
exhausted = true;
if (input.length() > 0) {
errorMessage = "Unexpected symbol: '" + input.charAt(0) + "'";
}
}
private void ignoreWhiteSpaces() {
int charsToDelete = 0;
while (blankChars.contains(input.charAt(charsToDelete))) {
charsToDelete++;
}
if (charsToDelete > 0) {
input.delete(0, charsToDelete);
}
}
private boolean findNextToken() {
for (Token t : Token.values()) {
int end = t.endOfMatch(input.toString());
if (end != -1) {
token = t;
lexema = input.substring(0, end);
input.delete(0, end);
return true;
}
}
return false;
}
public Token currentToken() {
return token;
}
public String currentLexema() {
return lexema;
}
public boolean isSuccessful() {
return errorMessage.isEmpty();
}
public String errorMessage() {
return errorMessage;
}
public boolean isExhausted() {
return exhausted;
}
}
Token.java:
package me.minkizz.minlang;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public enum Token {
PRINT_KEYWORD("print\\b"), PRINTLN_KEYWORD("println\\b"), OPEN_PARENTHESIS("\\("), CLOSE_PARENTHESIS("\\)"),
STRING("\"[^\"]+\""), NUMBER("\\d+(\\.\\d+)?");
private final Pattern pattern;
Token(String regex) {
pattern = Pattern.compile("^" + regex);
}
int endOfMatch(String s) {
Matcher m = pattern.matcher(s);
if (m.find()) {
return m.end();
}
return -1;
}
}
Main.java:
package me.minkizz.minlang;
public class Main {
public static void main(String[] args) {
new Main();
}
public Main() {
long start = System.nanoTime();
Interpreter.execute("C:\\Users\\leodu\\OneDrive\\Bureau\\minlang.txt");
long end = System.nanoTime();
System.out
.println("Program executed in " + (end - start) + "ns (" + Math.round((end - start) / 1000000) + "ms)");
}
}
Interpreter.java:
package me.minkizz.minlang;
public class Interpreter {
private static Token previousToken;
public static void execute(String fileName) {
Lexer lexer = new Lexer(fileName);
while (!lexer.isExhausted()) {
Token token = lexer.currentToken();
String lexema = lexer.currentLexema();
if (previousToken != null) {
if (token == Token.STRING || token == Token.NUMBER) {
if (previousToken == Token.PRINT_KEYWORD) {
System.out.print(lexema);
} else if (previousToken == Token.PRINTLN_KEYWORD) {
System.out.println(lexema);
}
}
}
previousToken = token;
lexer.moveAhead();
}
}
}
Example input:
print "a"
print "b"
-
1\$\begingroup\$ Never trust sub-second timings. Seriously, micro-benchmarking JIT compiled&improved execution is easy to get wrong - use a framework. \$\endgroup\$greybeard– greybeard2019年05月24日 19:40:17 +00:00Commented May 24, 2019 at 19:40
-
\$\begingroup\$ Can you provide a generator for relevant input or some other access? \$\endgroup\$greybeard– greybeard2019年05月24日 19:42:15 +00:00Commented May 24, 2019 at 19:42
-
1\$\begingroup\$ Can you specify what language this code is designed to handle? \$\endgroup\$200_success– 200_success2019年05月24日 20:38:05 +00:00Commented May 24, 2019 at 20:38
-
\$\begingroup\$ @200_success, it's a custom programming language... And it's not the problem, I'm talking about Lexer/Token here \$\endgroup\$Minkizz– Minkizz2019年05月25日 05:02:52 +00:00Commented May 25, 2019 at 5:02
-
1\$\begingroup\$ Please edit the question to include some example input. Normally you should not edit the question after an answer has been posted, but in this case, adding an example input file does not invalidate any answer, so that's ok. \$\endgroup\$Roland Illig– Roland Illig2019年05月26日 08:12:04 +00:00Commented May 26, 2019 at 8:12
2 Answers 2
Here are my comments
ignoreWhiteSpaces()
: instead of loop on individual chars, can be replaced with regex to find first char not in list.Deleting from the
StringBuilder
is unnecessary.Matcher
hasfind(int start)
now, once you adopt point 2, then you don't need
StringBuilder
at all. you can read the input into one line, usingFiles.readAllBytes()
(which probably performs better than one line at a time) and just keep an index pointer that moves along the input. so, for example,ignoreWhiteSpaces()
will return index of firstt non-whitespace char that is after the index pointer.
Instead of
find
uselookingAt
in order not to skip unmatched "garbage".Keywords/identifier: first something like a word (indentifier), and then check for keywords (
Map<String, Keyword or Token>
).Whitespace could be a pattern too
\\s*
but that is not necessarily faster. Compacter though. Or useCharacter.isWhitespace
.The usage of StringBuilder serves no purpose. Rather than deleting maintain a position. One can use that for
lookingAt
.Files.lines
is fine, but turn it into aStream<Token>
immediately then. Mind, that the default encoding is UTF-8 (but IMHO that is the preferable - international - encoding). The line read is stripped of the terminating line break (like\n
or\r\n
). Which might be taken into account for a whitespace sensitive grammar.st.forEach(line -> input.append(line).append('\n'));