Debugging Python ANTLR4 Grammar

Question 1

I'm having an issue with my ANTLR4 grammar not parsing a string correctly. I'm more interested in learning how to solve my problem than solving my specific problem. How can I generate any type of debug information? I want to know what the parser is "thinking" as it parses the string.

The grammar can be found here: https://github.com/Metrink/metrink-fe/blob/master/metrink.g4

I'm using the simple test string: -1d metric('blah', 'blah', 'blah')

I get the following error: 1:2 missing TIME_INDICATOR at 'd'

The grammar defines TIME_INDICATOR as [shmd] so I'm not sure how it's missing a TIME_INDICATOR at the character d when that is one of the possible tokens. What am I missing here?

I'm using Python3 generated from ANTLR4.

Question 2

link is broken! The correct is Metrink.g4 (with capitalized M)

Question 3

What I usually do is first dump the tokens to see if the actual tokens the parser expects are created.

You can do that with a small test class like this (easily ported to Python):

public class Main {
 static void test(String input) {
 
 metrinkLexer lexer = new metrinkLexer(new ANTLRInputStream(input));
 CommonTokenStream tokenStream = new CommonTokenStream(lexer);
 tokenStream.fill();
 System.out.printf("input: `%s`\n", input);
 for (Token token : tokenStream.getTokens()) {
 if (token.getType() != TLexer.EOF) {
 System.out.printf(" %-20s %s\n", metrinkLexer.VOCABULARY.getSymbolicName(token.getType()), token.getText());
 }
 }
 System.out.println();
 }
 public static void main(String[] args) throws Exception {
 test("-1d metric('blah', 'blah', 'blah')");
 }
}

If you run the code above, the following will get printed to your console:

input: `-1d metric('blah', 'blah', 'blah')`
 MINUS -
 INTEGER_LITERAL 1
 IDENTIFIER d
 METRIC metric
 LPAREN (
 STRING_LITERAL 'blah'
 COMMA ,
 STRING_LITERAL 'blah'
 COMMA ,
 STRING_LITERAL 'blah'
 RPAREN )

As you can see, the d is being tokenized as a IDENTIFIER instead of an TIME_INDICATOR. This is because the IDENTIFIER rule is defined before your TIME_INDICATOR rule. The lexer does not "listen" to what the parser might need, it simply matches the most characters as possible, and if two or more rules match the same amount of characters, the rule defined first "wins".

So, d can either be tokenized as TIME_INDICATOR or an IDENTIFIER. If this is dependent on context, I suggest you tokenize it as a IDENTIFIER (and remove TIME_INDICATOR) and create a parser rule like this:

relative_time_literal:
 MINUS? INTEGER_LITERAL time_indicator;
time_indicator:
 {_input.LT(1).getText().matches("[shmd]")}? IDENTIFIER;

The { ... }? is called a predicate: Semantic predicates in ANTLR4?

Also, FALSE and TRUE will need to be placed before the IDENTIFIER rule.

EDIT April 6 2024

Petr Pivonka wrote:

PAY ATTENTION! The note "easily ported to Python" needs to be explained! [...]

In Python that could look like this:

import antlr4
from metrinkLexer import metrinkLexer
def test(source):
 lexer = metrinkLexer(antlr4.InputStream(source))
 token_stream = antlr4.CommonTokenStream(lexer)
 token_stream.fill()
 print(f"input: {source}")
 for token in [t for t in token_stream.tokens if t.type != -1]:
 print(f" {lexer.symbolicNames[token.type].ljust(20)}{token.text}")
if __name__ == '__main__':
 test("-1d metric('blah', 'blah', 'blah')")

which will print:

input: -1d metric('blah', 'blah', 'blah')
 MINUS -
 INTEGER_LITERAL 1
 IDENTIFIER d
 METRIC metric
 LPAREN (
 STRING_LITERAL 'blah'
 COMMA ,
 STRING_LITERAL 'blah'
 COMMA ,
 STRING_LITERAL 'blah'
 RPAREN )

In other words:

metrinkLexer.VOCABULARY.getSymbolicName(type) becomes metrinkLexer.symbolicNames[type]
token.getType() becomes token.type
token.getText() becomes token.text

Petr Pivonka wrote:

It took me one day of labouring before I found that Java and Python are completely different nad that Python one is just empty shell.

That is not true. Everything in the Java API is also in the Python API (and also in C#, JavaScript, TypeScript, etc).

Question 4

I updated my code to include printing of tokens: github.com/Metrink/metrink-fe/commit/…

Question 5

PAY ATTENTION! The note "easily ported to Python" needs to be explained! In fact, it is not possible at all to convert Java targeted code above to Python even now in 2024. The reason is that Python runtime is absolutely derelict. Only class names are the same, but attributes are completely different and most of them are even missing in Python runtime ... forget Vocabulary, forget TokenNames, and many many others ... they simply do not exist in Python runtime. It took me one day of labouring before I found that Java and Python are completely different nad that Python one is just empty shell.

Question 6

@PetrPivonka I added a Python version. I think it's pretty close to the Java version. If you are not too familiar with ANTLR, I'd be hesitant with conclusions like "not possible" and "is just empty shell". As you can see from my Python example, this is not true. Next time, instead of comments like yours, I suggest you create a question here on SO and explain what you are trying to do. There are many people here willing to explain things.

Bart Kiers 171k38 gold badges308 silver badges297 bronze badges · Accepted Answer · 2016-05-03 18:31:01Z

What I usually do is first dump the tokens to see if the actual tokens the parser expects are created.

You can do that with a small test class like this (easily ported to Python):

public class Main {
 static void test(String input) {
 
 metrinkLexer lexer = new metrinkLexer(new ANTLRInputStream(input));
 CommonTokenStream tokenStream = new CommonTokenStream(lexer);
 tokenStream.fill();
 System.out.printf("input: `%s`\n", input);
 for (Token token : tokenStream.getTokens()) {
 if (token.getType() != TLexer.EOF) {
 System.out.printf(" %-20s %s\n", metrinkLexer.VOCABULARY.getSymbolicName(token.getType()), token.getText());
 }
 }
 System.out.println();
 }
 public static void main(String[] args) throws Exception {
 test("-1d metric('blah', 'blah', 'blah')");
 }
}

If you run the code above, the following will get printed to your console:

input: `-1d metric('blah', 'blah', 'blah')`
 MINUS -
 INTEGER_LITERAL 1
 IDENTIFIER d
 METRIC metric
 LPAREN (
 STRING_LITERAL 'blah'
 COMMA ,
 STRING_LITERAL 'blah'
 COMMA ,
 STRING_LITERAL 'blah'
 RPAREN )

As you can see, the d is being tokenized as a IDENTIFIER instead of an TIME_INDICATOR. This is because the IDENTIFIER rule is defined before your TIME_INDICATOR rule. The lexer does not "listen" to what the parser might need, it simply matches the most characters as possible, and if two or more rules match the same amount of characters, the rule defined first "wins".

So, d can either be tokenized as TIME_INDICATOR or an IDENTIFIER. If this is dependent on context, I suggest you tokenize it as a IDENTIFIER (and remove TIME_INDICATOR) and create a parser rule like this:

relative_time_literal:
 MINUS? INTEGER_LITERAL time_indicator;
time_indicator:
 {_input.LT(1).getText().matches("[shmd]")}? IDENTIFIER;

The { ... }? is called a predicate: Semantic predicates in ANTLR4?

Also, FALSE and TRUE will need to be placed before the IDENTIFIER rule.

EDIT April 6 2024

Petr Pivonka wrote:

PAY ATTENTION! The note "easily ported to Python" needs to be explained! [...]

In Python that could look like this:

import antlr4
from metrinkLexer import metrinkLexer
def test(source):
 lexer = metrinkLexer(antlr4.InputStream(source))
 token_stream = antlr4.CommonTokenStream(lexer)
 token_stream.fill()
 print(f"input: {source}")
 for token in [t for t in token_stream.tokens if t.type != -1]:
 print(f" {lexer.symbolicNames[token.type].ljust(20)}{token.text}")
if __name__ == '__main__':
 test("-1d metric('blah', 'blah', 'blah')")

which will print:

input: -1d metric('blah', 'blah', 'blah')
 MINUS -
 INTEGER_LITERAL 1
 IDENTIFIER d
 METRIC metric
 LPAREN (
 STRING_LITERAL 'blah'
 COMMA ,
 STRING_LITERAL 'blah'
 COMMA ,
 STRING_LITERAL 'blah'
 RPAREN )

In other words:

metrinkLexer.VOCABULARY.getSymbolicName(type) becomes metrinkLexer.symbolicNames[type]
token.getType() becomes token.type
token.getText() becomes token.text

Petr Pivonka wrote:

It took me one day of labouring before I found that Java and Python are completely different nad that Python one is just empty shell.

That is not true. Everything in the Java API is also in the Python API (and also in C#, JavaScript, TypeScript, etc).

I updated my code to include printing of tokens: github.com/Metrink/metrink-fe/commit/…
PAY ATTENTION! The note "easily ported to Python" needs to be explained! In fact, it is not possible at all to convert Java targeted code above to Python even now in 2024. The reason is that Python runtime is absolutely derelict. Only class names are the same, but attributes are completely different and most of them are even missing in Python runtime ... forget Vocabulary, forget TokenNames, and many many others ... they simply do not exist in Python runtime. It took me one day of labouring before I found that Java and Python are completely different nad that Python one is just empty shell.
@PetrPivonka I added a Python version. I think it's pretty close to the Java version. If you are not too familiar with ANTLR, I'd be hesitant with conclusions like "not possible" and "is just empty shell". As you can see from my Python example, this is not true. Next time, instead of comments like yours, I suggest you create a question here on SO and explain what you are trying to do. There are many people here willing to explain things.

CollectivesTM on Stack Overflow

Debugging Python ANTLR4 Grammar

1 Answer 1

EDIT April 6 2024

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

EDIT April 6 2024

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related