I've got to build an expression using a Recursive Descendent Parser Builder.
Here's my grammar :
----RULES----
<cond> → <termb> [OR <termb>]*
<termb>→<factb>[AND <factb>]*
<factb>→<expr> RELOP <expr> | NOT <factb> | OPAR <cond> CPAR
<expr> → [PLUS | MINUS] <term> [(PLUS <term>) | (MINUS <term>)]*
<term> → <termp> [(MULT <termp>) | (DIV <termp>) | (REM <termp>)]*
<termp> → <fact> [POWER <fact>]*
<fact> → ID | NUM | OPAR1 <expr> CPAR1
----TERMINALS----
ID → ("A" | ... | "Z" | "a" | ...| "z") [("A"| ... | "Z" | "a" | ...| "z" | "0" | ... | "9")]*
NUM → ("0" | ... | "9") [("0" | ... | "9")]*
OPAR → "("
CPAR → ")"
OPAR1 → "["
CPAR1 → "]"
RELOP → EQ | NEQ | GT | GE | LT | LE
EQ → "= ="
NEQ → "!="
GT → ">"
GE → ">="
LT → "<"
LE → "<="
POWER → "^"
DIV → "/"
REM → "%"
MULT → "*"
MINUS → "−"
PLUS → "+"
AND → "and" or "&&"
OR → "or" or "||"
NOT → "not" or "!"
Now, I've got a Composite structure with one class for each terminal and I'm trying to build it by following the rules of the above grammar.
The idea is that I need a method for each grammar rule and each of this methods has to build its part of the expression tree.
While it works with a simpler grammar (for example a grammar that allows just boolean expressions), I've got some problems with this one.
The main problem comes from the expr rule which forces me to work even with the unary versions of my + and - by allowing expressions such +3-4. This requires me to find out when the operand has to be considered unary and when binary!
Here's the code of my Builder class. Please note that something is in Italian but I've explained EVERYTHING, even my problems, using the comments. Also note that there is one line in which i used a pseudocode, so this one is NOT compilable!
package espressioniCondizionali;
import espressioniCondizionali.espressioni.*;
/**
*
* @author Stefano
*/
public class Builder6 {
/**
* This one is used just to parse the input String.
* It has a prossimoSimbolo() method (something like a nextSymbol()) that returns a String with the next Symbol in the String
*/
private GestoreInput gestoreInput;
/**
* Espressione is the Composite structure that represents and expression
*/
private Espressione root = null;
private String symbol = null;
public Builder6(GestoreInput gestoreInput) {
this.gestoreInput = gestoreInput;
}
public Espressione build() {
buildCond();
return root;
}
private void buildCond() {
buildTermb();
//I'm unsing a class named Simboli that holds every terminal symbol of my grammar. Symbol names are the same defined in the grammar.
while (symbol.equalsIgnoreCase(Simboli.OR1) || symbol.equalsIgnoreCase(Simboli.OR2)) {
Or or = new Or();
or.setLeft(root);
buildTermb();
or.setRight(root);
root = or;
}
}
private void buildTermb() {
buildFactb();
while (symbol.equalsIgnoreCase(Simboli.AND1) || symbol.equalsIgnoreCase(Simboli.AND2)) {
And and = new And();
and.setLeft(root);
buildFactb();
and.setRight(root);
root = and;
}
}
private void buildFactb() {
buildExpr();
if (symbol.equalsIgnoreCase(Simboli.EQ) || symbol.equalsIgnoreCase(Simboli.NEQ) || symbol.equalsIgnoreCase(Simboli.GT) || symbol.equalsIgnoreCase(Simboli.LT) || symbol.equalsIgnoreCase(Simboli.LE) || symbol.equalsIgnoreCase(Simboli.GE)) {
Operatore op = null;
switch (symbol) {
case Simboli.EQ: {
op = new Eq();
break;
}
case Simboli.NEQ: {
op = new Neq();
break;
}
case Simboli.GT: {
op = new Gt();
break;
}
case Simboli.GE: {
op = new Ge();
break;
}
case Simboli.LT: {
op = new Lt();
break;
}
case Simboli.LE: {
op = new Le();
break;
}
default: {
//Symbol not recognized, abort!
throw new RuntimeException("\"" + symbol + "\" non è un simbolo valido.");
}
}
op.setLeft(root);
buildExpr();
op.setRight(root);
root = op;
} else if (symbol.equalsIgnoreCase(Simboli.NOT1) || symbol.equals(Simboli.NOT2)) {
Not not = new Not();
buildFactb();
not.setRight(root);
root = not;
} else if (symbol.equalsIgnoreCase(Simboli.OPAR)) {
buildCond();
if (!symbol.equalsIgnoreCase(Simboli.CPAR)) { //If there's no closgin square bracket it means that our expression is not well formed
throw new RuntimeException("Espressione non valida. Attesa una \")\", trovato \"" + symbol + "\"");
}
}
}
private void buildExpr() {
Operatore op = null;
if (symbol != null) { //Let's check if our expression starts with a + or a -
if (symbol.equalsIgnoreCase(Simboli.PLUS)) {
op = new Plus();
} else if (symbol.equalsIgnoreCase(Simboli.MINUS)) {
op = new Minus();
}
}
buildTerm();
//If our op is still null, we didn't found a + or a - so our operand will be a binary one
//If op != null, our + or - is unary and we've got to manage it! A unary operand doesn't have a left son!
if (op != null) {
op.setRight(root);
root = op;
}
//Since we can have something like -3+2+s-x we've got to check it
while (symbol.equalsIgnoreCase(Simboli.PLUS) || symbol.equalsIgnoreCase(Simboli.MINUS)) {
Operatore op1 = null; //We define a new Operatore that will be a + or a -
switch (symbol) {
case Simboli.PLUS: {
op1 = new Plus();
break;
}
case Simboli.MINUS: {
op1 = new Minus();
break;
}
}
/*
* Here comes a BIG problem. We used the first if/else to check if
* our unary operand was at the beginning of the string, but now
* we've got to see if our current operand is either binary or
* unary! That's because we can have both a==-1+d and a==3-1+d. In
* the first case, the - is unary, while is binary in the second.
* So, how do we choose it?
*
* EXAMPLE : (-a>2 || -b-12>0)
* This one will be evaluated to -a > 2 || -12 > 0 that's clearly wrong!
* -b is missing before -12. That's because the -12 is used as unary
* and so it won't have a left child (the -b part)
*/
//--PSEUDOCODE
if (op1 is not unary) {
op1.setLeft(root);
}
//--END PSEUDOCODE
//CURRENT IMPLEMENTATION OF THE PSEUDOCODE PART
if (root != null && (root.getClass().equals(Num.class) || root.getClass().equals(Id.class))) { //It is binary if the previous value is a constant or a variable but not if it is an operand!
op1.setLeft(root);
}
//END OF CURRENT IMPLEMENTATION OF THE PSEUDOCODE PART
//Setting the right child must be done in both cases
buildTerm();
op1.setRight(root);
root = op1;
}
}
private void buildTerm() {
buildTermp();
while (symbol.equalsIgnoreCase(Simboli.MULT) || symbol.equalsIgnoreCase(Simboli.DIV) || symbol.equalsIgnoreCase(Simboli.REM)) {
Operatore op = null;
switch (symbol) {
case Simboli.MULT: {
op = new Mult();
break;
}
case Simboli.DIV: {
op = new Div();
break;
}
case Simboli.REM: {
op = new Rem();
break;
}
}
op.setLeft(root);
buildTermp();
op.setRight(root);
root = op;
}
}
private void buildTermp() {
buildFact();
while (symbol.equalsIgnoreCase(Simboli.POWER)) {
Power p = new Power();
p.setLeft(root);
buildFact();
p.setRight(root);
root = p;
}
}
private void buildFact() {
symbol = gestoreInput.prossimoSimbolo();
if (symbol.equalsIgnoreCase(Simboli.OPAR1)) { //Sottoespressione
buildExpr();
if (!symbol.equalsIgnoreCase(Simboli.CPAR1)) { //Se non c'è una parentesi chiusa vuol dire che l'espressione non valida
throw new RuntimeException("Espressione non valida. Attesa una \"]\", trovato \"" + symbol + "\"");
}
} else if (symbol.matches("[A-Za-z][A-Za-z | 0-9]*")) { //Nome di una variabile!
root = new Id(symbol);
symbol = gestoreInput.prossimoSimbolo();
} else if (symbol.matches("[0-9]*")) { //E' una costante!
root = new Num(symbol);
symbol = gestoreInput.prossimoSimbolo();
}
}
}
Known problems :
(a<=b && c>1) || a==4 evaluates to a <= b && c > 1
a==[-4] evaluates to a == a - 4
-4+3>c-2 evaluates to +3 > c - 2
Probably there are some more errors, but those are the most common.
So here are my questions :
First of all, do you think that this code has some logical issues? I mean, does it do what the grammar says or I did something really wrong?
How would you fix the expr method to make it work with both unary and binary + or - operands?
If my code is totally wrong, is there someone that can help me writing a working version? As you can see in the class name, this is the sixth implementation that I wrote, and I really don't know what to do next!
Thanks.
-
To answer these questions are almost a project on its own. My suggestion is to use a series of test-cases to test both the acceptance and rejection of statements.Jaco Van Niekerk– Jaco Van Niekerk2012年08月22日 14:25:32 +00:00Commented Aug 22, 2012 at 14:25
-
Maybe my questions are not well written, but I already know what doesn't work and why, I just don't know how to fix it :)StepTNT– StepTNT2012年08月22日 14:30:22 +00:00Commented Aug 22, 2012 at 14:30
1 Answer 1
Step 1: Think about the grammar in a different way
I think you are having problems implementing your recursive-descent parser because your grammar is very complex. The arbitrary iteration represented by the [ ... ]* forms is hard to wrap your head around. Try thinking about it this way instead:
<cond> → <termb> <cond-tail>
<cond-tail> → OR <termb> <cond-tail> | EPSILON
<termb> → <factb> <termb-tail>
<termb-tail> → AND <factb> <termb-tail> | EPSILON
<factb> → <expr> RELOP <expr> | NOT <factb> | OPAR <cond> CPAR
<expr> → <unary-op> <term> <expr-tail>
<unary-op> → PLUS | MINUS | EPSILON
<expr-tail> → ((PLUS <term>) | (MINUS <term>)) <expr-tail> | EPSILON
<term> → <termp> <term-tail>
<term-tail> → ((MULT <termp>) | (DIV <termp>) | (REM <termp>)) <term-tail> | EPSILON
<termp> → <fact> <termp-tail>
<termp-tail> → POWER <fact> <termp-tail> | EPSILON
<fact> → ID | NUM | OPAR1 <expr> CPAR1
EPSILON → ""
This grammar is the same as the one you posted, but I split the [ ... ]* rules off into their own non-terminals. In order to do that I added the EPSILON rule, which allows a non-terminal to match on "" (the empty string). This is basically the same thing as your [ ... ] rules, which may or may not actually match something. The epsilon rule is just a final alternative that says "stop trying to match now." For example, if you are currently parsing for a cond-tail then you would first check for an OR on the input. If you don't see an OR then you'd move on to the next alternative, which is EPSILON. That means there are no more OR conditions in this conditional chain, so this cond-tail matched on the empty string.
You can further simplify the grammar by grouping the operators together in expr-tail and term-tail:
<cond> → <termb> <cond-tail>
<cond-tail> → OR <termb> <cond-tail> | EPSILON
<termb> → <factb> <termb-tail>
<termb-tail> → AND <factb> <termb-tail> | EPSILON
<factb> → <expr> RELOP <expr> | NOT <factb> | OPAR <cond> CPAR
<expr> → <unary-op> <term> <expr-tail>
<unary-op> → PLUS | MINUS | EPSILON
<expr-tail> → (PLUS | MINUS) <term> <expr-tail> | EPSILON
<term> → <termp> <term-tail>
<term-tail> → (MULT | DIV | REM) <termp> <term-tail> | EPSILON
<termp> → <fact> <termp-tail>
<termp-tail> → POWER <fact> <termp-tail> | EPSILON
<fact> → ID | NUM | OPAR1 <expr> CPAR1
EPSILON → ""
Now you should go back and refactor your solution to add new methods for each of the new rules. (You can choose to leave the new rules' code in the original rules instead, but at least clearly mark the different sections with comments to help you keep track of which parts of the code are doing what.)
Now it should be very obvious in which cases + and - are unary operators because they'll be parsed by your new buildUnaryOp method (corresponding to the new unary-op rule in the grammar).
As for the new *-tail rules, you can choose to implement them strictly recursively, or you can implement them with a loop within the function body if you're worried about blowing out your stack on long expressions.
Step 2: Fix your state tracking
This is your main problem right here:
/**
* Espressione is the Composite structure that represents and expression
*/
private Espressione root = null;
private String symbol = null;
You are using instance-level fields to store the state of your recursive-descent parser. I've never seen a recursive-descent parser whose methods have a return-type of void—at least not if it's actually building an Abstract Syntax Tree. The example on wikipedia has void return-types, but that's because all it's doing is checking the input and throwing away the results, not actually building an AST from the input.
Since you're storing your state in class-level fields, the field is getting overwritten in each sibling-call to buildExpr() and you're losing your data. That's my theory anyway. You can test it by making a long string of exprs like "1+2+3+4+5". It should end up dropping all of the terms on the front.
I would recommend building the parser functionally (not using any .set*() methods), where each build*() method returns the parsed node from the AST. That would involve changing all of your constructors to accept the child nodes. You can still use mutation instead if that's more clear to you though. Here's a refactored buildCond:
private Espressione buildCond() {
Espressione leftHandSide = buildTermb();
if (symbol.equalsIgnoreCase(Simboli.OR1) || symbol.equalsIgnoreCase(Simboli.OR2)) {
Or or = new Or();
or.setLeft(leftHandSide);
or.setRight(buildTermb());
return or;
}
return leftHandSide;
}
Here's what the new build() would look like:
public Espressione build() {
return buildCond();
}
If you re-work the rest of your code to build the tree in this manner (rather than using the instance-level fields to hold your state) then it should work much better.
Good luck!
5 Comments
buildUnaryOp is just a "helper method" that helps decompose a very large method into smaller methods. Isn't that what they teach you to do in your program design classes?static keyword anywhere.Explore related questions
See similar questions with these tags.