Writing a Recursive Descendent Parser for my grammar

Question 1

I've got to build an expression using a Recursive Descendent Parser Builder.

Here's my grammar :

----RULES----
<cond> → <termb> [OR <termb>]*
<termb>→<factb>[AND <factb>]*
<factb>→<expr> RELOP <expr> | NOT <factb> | OPAR <cond> CPAR
<expr> → [PLUS | MINUS] <term> [(PLUS <term>) | (MINUS <term>)]*
<term> → <termp> [(MULT <termp>) | (DIV <termp>) | (REM <termp>)]*
<termp> → <fact> [POWER <fact>]*
<fact> → ID | NUM | OPAR1 <expr> CPAR1
----TERMINALS----
ID → ("A" | ... | "Z" | "a" | ...| "z") [("A"| ... | "Z" | "a" | ...| "z" | "0" | ... | "9")]*
NUM → ("0" | ... | "9") [("0" | ... | "9")]*
OPAR → "("
CPAR → ")"
OPAR1 → "["
CPAR1 → "]"
RELOP → EQ | NEQ | GT | GE | LT | LE
EQ → "= ="
NEQ → "!="
GT → ">"
GE → ">="
LT → "<"
LE → "<="
POWER → "^"
DIV → "/"
REM → "%"
MULT → "*"
MINUS → "−"
PLUS → "+"
AND → "and" or "&&"
OR → "or" or "||"
NOT → "not" or "!"

Now, I've got a Composite structure with one class for each terminal and I'm trying to build it by following the rules of the above grammar.

The idea is that I need a method for each grammar rule and each of this methods has to build its part of the expression tree.

While it works with a simpler grammar (for example a grammar that allows just boolean expressions), I've got some problems with this one.

The main problem comes from the expr rule which forces me to work even with the unary versions of my + and - by allowing expressions such +3-4. This requires me to find out when the operand has to be considered unary and when binary!

Here's the code of my Builder class. Please note that something is in Italian but I've explained EVERYTHING, even my problems, using the comments. Also note that there is one line in which i used a pseudocode, so this one is NOT compilable!

package espressioniCondizionali;
import espressioniCondizionali.espressioni.*;
/**
 *
 * @author Stefano
 */
public class Builder6 {
 /**
 * This one is used just to parse the input String.
 * It has a prossimoSimbolo() method (something like a nextSymbol()) that returns a String with the next Symbol in the String
 */
 private GestoreInput gestoreInput;
 /**
 * Espressione is the Composite structure that represents and expression
 */
 private Espressione root = null;
 private String symbol = null;
 public Builder6(GestoreInput gestoreInput) {
 this.gestoreInput = gestoreInput;
 }
 public Espressione build() {
 buildCond();
 return root;
 }
 private void buildCond() {
 buildTermb();
 //I'm unsing a class named Simboli that holds every terminal symbol of my grammar. Symbol names are the same defined in the grammar.
 while (symbol.equalsIgnoreCase(Simboli.OR1) || symbol.equalsIgnoreCase(Simboli.OR2)) {
 Or or = new Or();
 or.setLeft(root);
 buildTermb();
 or.setRight(root);
 root = or;
 }
 }
 private void buildTermb() {
 buildFactb();
 while (symbol.equalsIgnoreCase(Simboli.AND1) || symbol.equalsIgnoreCase(Simboli.AND2)) {
 And and = new And();
 and.setLeft(root);
 buildFactb();
 and.setRight(root);
 root = and;
 }
 }
 private void buildFactb() {
 buildExpr();
 if (symbol.equalsIgnoreCase(Simboli.EQ) || symbol.equalsIgnoreCase(Simboli.NEQ) || symbol.equalsIgnoreCase(Simboli.GT) || symbol.equalsIgnoreCase(Simboli.LT) || symbol.equalsIgnoreCase(Simboli.LE) || symbol.equalsIgnoreCase(Simboli.GE)) {
 Operatore op = null;
 switch (symbol) {
 case Simboli.EQ: {
 op = new Eq();
 break;
 }
 case Simboli.NEQ: {
 op = new Neq();
 break;
 }
 case Simboli.GT: {
 op = new Gt();
 break;
 }
 case Simboli.GE: {
 op = new Ge();
 break;
 }
 case Simboli.LT: {
 op = new Lt();
 break;
 }
 case Simboli.LE: {
 op = new Le();
 break;
 }
 default: {
 //Symbol not recognized, abort!
 throw new RuntimeException("\"" + symbol + "\" non è un simbolo valido.");
 }
 }
 op.setLeft(root);
 buildExpr();
 op.setRight(root);
 root = op;
 } else if (symbol.equalsIgnoreCase(Simboli.NOT1) || symbol.equals(Simboli.NOT2)) {
 Not not = new Not();
 buildFactb();
 not.setRight(root);
 root = not;
 } else if (symbol.equalsIgnoreCase(Simboli.OPAR)) {
 buildCond();
 if (!symbol.equalsIgnoreCase(Simboli.CPAR)) { //If there's no closgin square bracket it means that our expression is not well formed
 throw new RuntimeException("Espressione non valida. Attesa una \")\", trovato \"" + symbol + "\"");
 }
 }
 }
 private void buildExpr() {
 Operatore op = null;
 if (symbol != null) { //Let's check if our expression starts with a + or a -
 if (symbol.equalsIgnoreCase(Simboli.PLUS)) {
 op = new Plus();
 } else if (symbol.equalsIgnoreCase(Simboli.MINUS)) {
 op = new Minus();
 }
 }
 buildTerm();
 //If our op is still null, we didn't found a + or a - so our operand will be a binary one
 //If op != null, our + or - is unary and we've got to manage it! A unary operand doesn't have a left son!
 if (op != null) {
 op.setRight(root);
 root = op;
 }
 //Since we can have something like -3+2+s-x we've got to check it
 while (symbol.equalsIgnoreCase(Simboli.PLUS) || symbol.equalsIgnoreCase(Simboli.MINUS)) {
 Operatore op1 = null; //We define a new Operatore that will be a + or a -
 switch (symbol) {
 case Simboli.PLUS: {
 op1 = new Plus();
 break;
 }
 case Simboli.MINUS: {
 op1 = new Minus();
 break;
 }
 }
 /*
 * Here comes a BIG problem. We used the first if/else to check if
 * our unary operand was at the beginning of the string, but now
 * we've got to see if our current operand is either binary or
 * unary! That's because we can have both a==-1+d and a==3-1+d. In
 * the first case, the - is unary, while is binary in the second.
 * So, how do we choose it?
 * 
 * EXAMPLE : (-a>2 || -b-12>0)
 * This one will be evaluated to -a > 2 || -12 > 0 that's clearly wrong!
 * -b is missing before -12. That's because the -12 is used as unary 
 * and so it won't have a left child (the -b part)
 */
 //--PSEUDOCODE
 if (op1 is not unary) {
 op1.setLeft(root);
 }
 //--END PSEUDOCODE
 //CURRENT IMPLEMENTATION OF THE PSEUDOCODE PART
 if (root != null && (root.getClass().equals(Num.class) || root.getClass().equals(Id.class))) { //It is binary if the previous value is a constant or a variable but not if it is an operand! 
 op1.setLeft(root);
 }
 //END OF CURRENT IMPLEMENTATION OF THE PSEUDOCODE PART
 //Setting the right child must be done in both cases
 buildTerm();
 op1.setRight(root);
 root = op1;
 }
 }
 private void buildTerm() {
 buildTermp();
 while (symbol.equalsIgnoreCase(Simboli.MULT) || symbol.equalsIgnoreCase(Simboli.DIV) || symbol.equalsIgnoreCase(Simboli.REM)) {
 Operatore op = null;
 switch (symbol) {
 case Simboli.MULT: {
 op = new Mult();
 break;
 }
 case Simboli.DIV: {
 op = new Div();
 break;
 }
 case Simboli.REM: {
 op = new Rem();
 break;
 }
 }
 op.setLeft(root);
 buildTermp();
 op.setRight(root);
 root = op;
 }
 }
 private void buildTermp() {
 buildFact();
 while (symbol.equalsIgnoreCase(Simboli.POWER)) {
 Power p = new Power();
 p.setLeft(root);
 buildFact();
 p.setRight(root);
 root = p;
 }
 }
 private void buildFact() {
 symbol = gestoreInput.prossimoSimbolo();
 if (symbol.equalsIgnoreCase(Simboli.OPAR1)) { //Sottoespressione 
 buildExpr();
 if (!symbol.equalsIgnoreCase(Simboli.CPAR1)) { //Se non c'è una parentesi chiusa vuol dire che l'espressione non valida
 throw new RuntimeException("Espressione non valida. Attesa una \"]\", trovato \"" + symbol + "\"");
 }
 } else if (symbol.matches("[A-Za-z][A-Za-z | 0-9]*")) { //Nome di una variabile!
 root = new Id(symbol);
 symbol = gestoreInput.prossimoSimbolo();
 } else if (symbol.matches("[0-9]*")) { //E' una costante!
 root = new Num(symbol);
 symbol = gestoreInput.prossimoSimbolo();
 }
 }
}

Known problems :

(a<=b && c>1) || a==4 evaluates to a <= b && c > 1

a==[-4] evaluates to a == a - 4

-4+3>c-2 evaluates to +3 > c - 2

Probably there are some more errors, but those are the most common.

So here are my questions :

First of all, do you think that this code has some logical issues? I mean, does it do what the grammar says or I did something really wrong? How would you fix the expr method to make it work with both unary and binary + or - operands?

If my code is totally wrong, is there someone that can help me writing a working version? As you can see in the class name, this is the sixth implementation that I wrote, and I really don't know what to do next!

Thanks.

Question 2

To answer these questions are almost a project on its own. My suggestion is to use a series of test-cases to test both the acceptance and rejection of statements.

Question 3

Maybe my questions are not well written, but I already know what doesn't work and why, I just don't know how to fix it :)

Question 4

Step 1: Think about the grammar in a different way

I think you are having problems implementing your recursive-descent parser because your grammar is very complex. The arbitrary iteration represented by the [ ... ]* forms is hard to wrap your head around. Try thinking about it this way instead:

<cond> → <termb> <cond-tail>
<cond-tail> → OR <termb> <cond-tail> | EPSILON
<termb> → <factb> <termb-tail>
<termb-tail> → AND <factb> <termb-tail> | EPSILON
<factb> → <expr> RELOP <expr> | NOT <factb> | OPAR <cond> CPAR
<expr> → <unary-op> <term> <expr-tail>
<unary-op> → PLUS | MINUS | EPSILON
<expr-tail> → ((PLUS <term>) | (MINUS <term>)) <expr-tail> | EPSILON
<term> → <termp> <term-tail>
<term-tail> → ((MULT <termp>) | (DIV <termp>) | (REM <termp>)) <term-tail> | EPSILON
<termp> → <fact> <termp-tail>
<termp-tail> → POWER <fact> <termp-tail> | EPSILON
<fact> → ID | NUM | OPAR1 <expr> CPAR1
EPSILON → ""

This grammar is the same as the one you posted, but I split the [ ... ]* rules off into their own non-terminals. In order to do that I added the EPSILON rule, which allows a non-terminal to match on "" (the empty string). This is basically the same thing as your [ ... ] rules, which may or may not actually match something. The epsilon rule is just a final alternative that says "stop trying to match now." For example, if you are currently parsing for a cond-tail then you would first check for an OR on the input. If you don't see an OR then you'd move on to the next alternative, which is EPSILON. That means there are no more OR conditions in this conditional chain, so this cond-tail matched on the empty string.

You can further simplify the grammar by grouping the operators together in expr-tail and term-tail:

<cond> → <termb> <cond-tail>
<cond-tail> → OR <termb> <cond-tail> | EPSILON
<termb> → <factb> <termb-tail>
<termb-tail> → AND <factb> <termb-tail> | EPSILON
<factb> → <expr> RELOP <expr> | NOT <factb> | OPAR <cond> CPAR
<expr> → <unary-op> <term> <expr-tail>
<unary-op> → PLUS | MINUS | EPSILON
<expr-tail> → (PLUS | MINUS) <term> <expr-tail> | EPSILON
<term> → <termp> <term-tail>
<term-tail> → (MULT | DIV | REM) <termp> <term-tail> | EPSILON
<termp> → <fact> <termp-tail>
<termp-tail> → POWER <fact> <termp-tail> | EPSILON
<fact> → ID | NUM | OPAR1 <expr> CPAR1
EPSILON → ""

Now you should go back and refactor your solution to add new methods for each of the new rules. (You can choose to leave the new rules' code in the original rules instead, but at least clearly mark the different sections with comments to help you keep track of which parts of the code are doing what.)

Now it should be very obvious in which cases + and - are unary operators because they'll be parsed by your new buildUnaryOp method (corresponding to the new unary-op rule in the grammar).

As for the new *-tail rules, you can choose to implement them strictly recursively, or you can implement them with a loop within the function body if you're worried about blowing out your stack on long expressions.

Step 2: Fix your state tracking

This is your main problem right here:

/**
 * Espressione is the Composite structure that represents and expression
 */
private Espressione root = null;
private String symbol = null;

You are using instance-level fields to store the state of your recursive-descent parser. I've never seen a recursive-descent parser whose methods have a return-type of void—at least not if it's actually building an Abstract Syntax Tree. The example on wikipedia has void return-types, but that's because all it's doing is checking the input and throwing away the results, not actually building an AST from the input.

Since you're storing your state in class-level fields, the field is getting overwritten in each sibling-call to buildExpr() and you're losing your data. That's my theory anyway. You can test it by making a long string of exprs like "1+2+3+4+5". It should end up dropping all of the terms on the front.

I would recommend building the parser functionally (not using any .set*() methods), where each build*() method returns the parsed node from the AST. That would involve changing all of your constructors to accept the child nodes. You can still use mutation instead if that's more clear to you though. Here's a refactored buildCond:

private Espressione buildCond() {
 Espressione leftHandSide = buildTermb();
 if (symbol.equalsIgnoreCase(Simboli.OR1) || symbol.equalsIgnoreCase(Simboli.OR2)) {
 Or or = new Or();
 or.setLeft(leftHandSide);
 or.setRight(buildTermb());
 return or;
 }
 return leftHandSide;
 }

Here's what the new build() would look like:

public Espressione build() {
 return buildCond();
}

If you re-work the rest of your code to build the tree in this manner (rather than using the instance-level fields to hold your state) then it should work much better.

Good luck!

Question 5

I really appreciate your solution, but I think that I can't make it because the grammar is part of the assignment and I can't change it :\ But, again, your answer is really good, thanks!

Question 6

The grammar I posted is exactly the same as the one you posted, just written in a different way. I'm trying to give you a different way to think about it. No compilers TA I know would mark you down for adding "helper methods" to make your code more clear. buildUnaryOp is just a "helper method" that helps decompose a very large method into smaller methods. Isn't that what they teach you to do in your program design classes?

Question 7

I know that the grammar is basically the same, but I think that I'm forced to follow the pattern that my teacher gave. This means that I can't add new methods and do this kind of refactoring. Your solution would be the right one if this wasn't an assignment and if I would not have to follow what my teacher said! :( So, to summarize, I need to do it without rewriting grammar rules and having exactly one method for each rule of the given grammar!

Question 8

I've updated my answer. It seems like Step 1 didn't help you much, but Step 2 should solve your problems. Storing state in instance-level fields (non-static members of your class) during recursion is pretty much always a bad idea. I'm pretty sure that's what's causing you to lose pieces of your expected results.

Question 9

Just to clear up any possible misunderstanding with my last comment, storing recursive state in static members would be just as bad if not worse. I wasn't advocating that you add the static keyword anywhere.

DaoWen 33.1k6 gold badges77 silver badges102 bronze badges · Accepted Answer · 2012-08-22 15:11:05Z

Step 1: Think about the grammar in a different way

I think you are having problems implementing your recursive-descent parser because your grammar is very complex. The arbitrary iteration represented by the [ ... ]* forms is hard to wrap your head around. Try thinking about it this way instead:

<cond> → <termb> <cond-tail>
<cond-tail> → OR <termb> <cond-tail> | EPSILON
<termb> → <factb> <termb-tail>
<termb-tail> → AND <factb> <termb-tail> | EPSILON
<factb> → <expr> RELOP <expr> | NOT <factb> | OPAR <cond> CPAR
<expr> → <unary-op> <term> <expr-tail>
<unary-op> → PLUS | MINUS | EPSILON
<expr-tail> → ((PLUS <term>) | (MINUS <term>)) <expr-tail> | EPSILON
<term> → <termp> <term-tail>
<term-tail> → ((MULT <termp>) | (DIV <termp>) | (REM <termp>)) <term-tail> | EPSILON
<termp> → <fact> <termp-tail>
<termp-tail> → POWER <fact> <termp-tail> | EPSILON
<fact> → ID | NUM | OPAR1 <expr> CPAR1
EPSILON → ""

This grammar is the same as the one you posted, but I split the [ ... ]* rules off into their own non-terminals. In order to do that I added the EPSILON rule, which allows a non-terminal to match on "" (the empty string). This is basically the same thing as your [ ... ] rules, which may or may not actually match something. The epsilon rule is just a final alternative that says "stop trying to match now." For example, if you are currently parsing for a cond-tail then you would first check for an OR on the input. If you don't see an OR then you'd move on to the next alternative, which is EPSILON. That means there are no more OR conditions in this conditional chain, so this cond-tail matched on the empty string.

You can further simplify the grammar by grouping the operators together in expr-tail and term-tail:

<cond> → <termb> <cond-tail>
<cond-tail> → OR <termb> <cond-tail> | EPSILON
<termb> → <factb> <termb-tail>
<termb-tail> → AND <factb> <termb-tail> | EPSILON
<factb> → <expr> RELOP <expr> | NOT <factb> | OPAR <cond> CPAR
<expr> → <unary-op> <term> <expr-tail>
<unary-op> → PLUS | MINUS | EPSILON
<expr-tail> → (PLUS | MINUS) <term> <expr-tail> | EPSILON
<term> → <termp> <term-tail>
<term-tail> → (MULT | DIV | REM) <termp> <term-tail> | EPSILON
<termp> → <fact> <termp-tail>
<termp-tail> → POWER <fact> <termp-tail> | EPSILON
<fact> → ID | NUM | OPAR1 <expr> CPAR1
EPSILON → ""

Now you should go back and refactor your solution to add new methods for each of the new rules. (You can choose to leave the new rules' code in the original rules instead, but at least clearly mark the different sections with comments to help you keep track of which parts of the code are doing what.)

Now it should be very obvious in which cases + and - are unary operators because they'll be parsed by your new buildUnaryOp method (corresponding to the new unary-op rule in the grammar).

As for the new *-tail rules, you can choose to implement them strictly recursively, or you can implement them with a loop within the function body if you're worried about blowing out your stack on long expressions.

Step 2: Fix your state tracking

This is your main problem right here:

/**
 * Espressione is the Composite structure that represents and expression
 */
private Espressione root = null;
private String symbol = null;

You are using instance-level fields to store the state of your recursive-descent parser. I've never seen a recursive-descent parser whose methods have a return-type of void—at least not if it's actually building an Abstract Syntax Tree. The example on wikipedia has void return-types, but that's because all it's doing is checking the input and throwing away the results, not actually building an AST from the input.

Since you're storing your state in class-level fields, the field is getting overwritten in each sibling-call to buildExpr() and you're losing your data. That's my theory anyway. You can test it by making a long string of exprs like "1+2+3+4+5". It should end up dropping all of the terms on the front.

I would recommend building the parser functionally (not using any .set*() methods), where each build*() method returns the parsed node from the AST. That would involve changing all of your constructors to accept the child nodes. You can still use mutation instead if that's more clear to you though. Here's a refactored buildCond:

private Espressione buildCond() {
 Espressione leftHandSide = buildTermb();
 if (symbol.equalsIgnoreCase(Simboli.OR1) || symbol.equalsIgnoreCase(Simboli.OR2)) {
 Or or = new Or();
 or.setLeft(leftHandSide);
 or.setRight(buildTermb());
 return or;
 }
 return leftHandSide;
 }

Here's what the new build() would look like:

public Espressione build() {
 return buildCond();
}

If you re-work the rest of your code to build the tree in this manner (rather than using the instance-level fields to hold your state) then it should work much better.

Good luck!

I really appreciate your solution, but I think that I can't make it because the grammar is part of the assignment and I can't change it :\ But, again, your answer is really good, thanks!
The grammar I posted is exactly the same as the one you posted, just written in a different way. I'm trying to give you a different way to think about it. No compilers TA I know would mark you down for adding "helper methods" to make your code more clear. buildUnaryOp is just a "helper method" that helps decompose a very large method into smaller methods. Isn't that what they teach you to do in your program design classes?
I know that the grammar is basically the same, but I think that I'm forced to follow the pattern that my teacher gave. This means that I can't add new methods and do this kind of refactoring. Your solution would be the right one if this wasn't an assignment and if I would not have to follow what my teacher said! :( So, to summarize, I need to do it without rewriting grammar rules and having exactly one method for each rule of the given grammar!
I've updated my answer. It seems like Step 1 didn't help you much, but Step 2 should solve your problems. Storing state in instance-level fields (non-static members of your class) during recursion is pretty much always a bad idea. I'm pretty sure that's what's causing you to lose pieces of your expected results.
Just to clear up any possible misunderstanding with my last comment, storing recursive state in static members would be just as bad if not worse. I wasn't advocating that you add the static keyword anywhere.

CollectivesTM on Stack Overflow

Writing a Recursive Descendent Parser for my grammar

1 Answer 1

Step 1: Think about the grammar in a different way

Step 2: Fix your state tracking

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Step 1: Think about the grammar in a different way

Step 2: Fix your state tracking

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related