A header-only linear-time C++11 PEG parser generator supporting left-recursion and grammar ambiguity
I've rewritten my original parser generator to a header-only library which uses templates and functionals for better type safety and clarity. The generated parser creates an abstract syntax tree which can be evaluated effectively using functionals and a visitor pattern. The parser memorizes intermediate steps resulting in guaranteed linear parsing time (squared in worst-case if the grammar includes left-recursion). My goal is to create a general-purpose C++ parser generator with focus on usability. So far documentation is missing but I believe the usage should become more or less clear from the following example.
This code creates a simple calculator using left-recursion and C++11 functionals:
#include <iostream>
#include <unordered_map>
#include <cmath>
#include "parser/parser.h"
using namespace std;
using namespace lars;
int main(int argc, char ** argv){
parsing_expression_grammar_builder<double> g;
using expression = expression<double>;
unordered_map<string,double> variables;
g["Expression"] << "Set | Sum" << [ ](expression e){ e.value() = e[0].get_value(); };
g["Set" ] << "Name '=' Sum" << [&](expression e){ variables[e[0].string()] = e[1].get_value(); };
g["Sum" ] << "Add | Subtract | Product" << [ ](expression e){ e.value() = e[0].get_value(); };
g["Add" ] << "Sum '+' Product" << [ ](expression e){ e.value() = e[0].get_value() + e[1].get_value(); };
g["Subtract" ] << "Sum '-' Product" << [ ](expression e){ e.value() = e[0].get_value() - e[1].get_value(); };
g["Product" ] << "Multiply | Divide | Exponent" << [ ](expression e){ e.value() = e[0].get_value(); };
g["Multiply" ] << "Product '*' Exponent" << [ ](expression e){ e.value() = e[0].get_value() * e[1].get_value(); };
g["Divide" ] << "Product '/' Exponent" << [ ](expression e){ e.value() = e[0].get_value() / e[1].get_value(); };
g["Exponent" ] << "Power | Atomic" << [ ](expression e){ e.value() = e[0].get_value(); };
g["Power" ] << "Atomic '^' Exponent" << [ ](expression e){ e.value() = pow(e[0].get_value(),e[1].get_value()); };
g["Atomic" ] << "Number | Brackets | Variable" << [ ](expression e){ e.value() = e[0].get_value(); };
g["Brackets" ] << "'(' Sum ')'" << [ ](expression e){ e.value() = e[0].get_value(); };
g["Number" ] << "'-'? [0-9]+ ('.' [0-9]+)?" << [ ](expression e){ e.value() = stod(e.string()); };
g["Variable" ] << "Name" << [&](expression e){ e.value() = variables[e[0].string()]; };
g["Name" ] << "[a-zA-Z] [a-zA-Z0-9]*" ;
g.set_starting_rule("Expression");
g["Whitespace"] << "[ \t]";
g.set_separator_rule("Whitespace");
auto p = g.get_parser();
while (true) {
string str;
cout << "> ";
getline(cin,str);
if(str == "q" || str == "quit")break;
try {
auto e = p.parse(str);
cout << str << " = " << *e.evaluate() << endl;
}
catch (parser<double>::error e){
cout << " ";
for(auto i UNUSED :range(e.begin_position().character))cout << " ";
for(auto i UNUSED :range(e.length()))cout << "~";
cout << "^\n";
cout << e.error_message() << " while parsing " << e.rule_name() << endl;
}
}
return 0;
}
This code creates the same without left-recursion and using a visitor pattern:
#include <iostream>
#include <unordered_map>
#include <cmath>
#include "parser/parser.h"
using namespace std;
using namespace lars;
class math_visitor{
double value;
unordered_map<string,double> variables;
public:
double get_value(){
return value;
}
double get_value(expression<math_visitor> e){
e.accept(this);
return get_value();
}
void visit_number(expression<math_visitor> e){
value = stod(e.string());
}
void visit_set_variable(expression<math_visitor> e){
variables[e[0].string()] = get_value(e[1]);
}
void visit_variable(expression<math_visitor> e){
value = variables[e[0].string()];
}
void visit_left_binary_operator_list (expression<math_visitor> e){
double lhs = get_value(e[0]);
for(auto i:range((e.size()-1)/2)*2+1){
double rhs = get_value(e[i+1]);
if(e[i].character()=='+'){ lhs = lhs + rhs; }
else if(e[i].character()=='-'){ lhs = lhs - rhs; }
else if(e[i].character()=='*'){ lhs = lhs * rhs; }
else if(e[i].character()=='/'){ lhs = lhs / rhs; }
else throw "undefined operator";
}
value = lhs;
}
void visit_exponent(expression<math_visitor> e){
if(e.size() == 1) e[0].accept();
else value = pow(get_value(e[0]), get_value(e[1]));
}
};
int main(int argc, char ** argv){
parsing_expression_grammar_builder<math_visitor> g;
using expression = expression<math_visitor>;
g["Expression"] << "Set | Sum" << [](expression e){ e[0].accept(); };
g["Set" ] << "Name '=' Sum" << [](expression e){ e[0].visitor().visit_set_variable(e); };
g["Sum" ] << "Product (AddSub Product)*" << [](expression e){ e.visitor().visit_left_binary_operator_list(e); };
g["AddSub" ] << "[+-]" ;
g["Product" ] << "Exponent (MulDiv Exponent)*" << [](expression e){ e.visitor().visit_left_binary_operator_list(e); };
g["MulDiv" ] << "[*/]" ;
g["Exponent" ] << "Atomic (('^' | '**') Exponent) | Atomic" << [](expression e){ e.visitor().visit_exponent(e); };
g["Atomic" ] << "Number | Brackets | Variable" << [](expression e){ e[0].accept(); };
g["Brackets" ] << "'(' Sum ')'" << [](expression e){ e[0].accept(); };
g["Number" ] << "'-'? [0-9]+ ('.' [0-9]+)?" << [](expression e){ e.visitor().visit_number(e); };
g["Variable" ] << "Name" << [](expression e){ e.visitor().visit_variable(e); };
g["Name" ] << "[a-zA-Z] [a-zA-Z]*" ;
g.set_starting_rule("Expression");
g["Whitespace"] << "[ \t]";
g.set_separator_rule("Whitespace");
auto p = g.get_parser();
math_visitor visitor;
while (true) {
string str;
cout << "> ";
getline(cin,str);
if(str == "q" || str == "quit")break;
cout << " -> ";
try { p.parse(str).accept(&visitor); cout << visitor.get_value(); }
catch (parser<math_visitor>::error e){ cout << e.error_message(); }
cout << endl;
}
return 0;
}
Some languages (including C) use ambiguous grammars where an expression such as x*y
obviously needs to be parsed differently if x is a) a variable or b) a type. To resolve this I introduced grammar filters which are functionals (expression)->bool
called immediately after matching a production rule.
For example, a word in the following grammar will only register as a type if there is a type named as the word and otherwise as a variable:
unordered_set<string> types;
g["Expression"] << "Type | Variable";
g["Type"] << "Name" >> [&](expression e)->bool{ return types.find(e[0].string()) != types.end(); };
g["Variable"] << "Name" ;
g["Name"] << "[a-zA-Z] [a-zA-Z0-9]*";
What I would like to know is: Is it clear what is happening here based on the examples? Can they or other syntax be improved? Do you think the project is useful?
The full source code is available at GitHib.
2 Answers 2
You have "data" embedded in code in the function visit_left_binary_operator_list
. You could use an std::unordered_map<char, double(*)(double, double)>
to help separate the data and the actual algorithm:
void visit_left_binary_operator_list (expression<math_visitor> e){
static const std::unordered_map<char, double(*)(double, double)> binary_ops = {
{ '+', [](double x, double y) { return x + y; } },
{ '-', [](double x, double y) { return x - y; } },
{ '*', [](double x, double y) { return x * y; } },
{ '/', [](double x, double y) { return x / y; } }
};
double lhs = get_value(e[0]);
for(auto i:range((e.size()-1)/2)*2+1){
auto op = binary_ops.find(e[i].character());
if (op == binary_ops.end()) {
throw "undefined operator";
}
double rhs = get_value(e[i+1]);
lhs = op->second(lhs, rhs);
}
value = lhs;
}
You may also want to find a suitable name for the expression \$ ((x-1)/2)*2+1 \$. Currently, it is not really readable, but it feels like an operation common enough in the wild to need a proper name.
Instead of having a special case for exponent, wouldn't it be better to have a visit_right_binary_operator_list
as well? It would be a first step to some more generic handling of operators. Also, it would be more generic if operators could be std::string
instances instead of char
instances to handle multi-character operators.
To answer your question: while I don't have a full understanding of lexers and parsers, what your code does is clear to me. The only thing I had a problem with at first was the value()
/get_value()
whose naming can be a little bit ambiguous. But once you know visitors, regex and lambdas, you quickly have a good understanding of what is happening.
-
\$\begingroup\$ +1 Great idea with the operator map! I should probably make something like
range(begin,end,increment_size)
to improve readability. For consistency maybe a right operator list would make more sense. I just omitted it here to keep the example short and illustrate that they don't require left-recusion. But I'll probably go for the more consistent way. I usedchar
here because I wanted to show this function in the example but maybestring
would show off the posibilites more ;-) . The visitor pattern was the part I was most worried about, I think I will write a tutorial to clarify. Thanks! \$\endgroup\$Lars Melchior– Lars Melchior2014年12月18日 11:55:49 +00:00Commented Dec 18, 2014 at 11:55 -
\$\begingroup\$ @Lars I understand your choices. When I wrote my first expression parser (really ugly), I only used one-character operators and only implemented what I needed right now without thinking of the future. Using
std::string
should be cleaner now that we have the user-defined literal""s
anyway :) \$\endgroup\$Morwenn– Morwenn2014年12月18日 12:06:06 +00:00Commented Dec 18, 2014 at 12:06
Maybe this should be a comment but why is it header-only?
More especially, why is the grammar (e.g. strings such as "Set | Sum"
) embedded in the header software, instead of being in a text/grammar file? I'm used to seeing the grammar in a file that's separate from the parser software.
My second comment is that you don't seem to have a separate lexxer and parser?
Thirdly it's not obvious at first glance what statements like ...
[](expression e){ e.visitor().visit_left_binary_operator_list(e); }
... are about. Yet apparently (since they're mixed-in to the grammar) I'd need to understand/write statements like that, if I want to use the parser with a new grammar (i.e. if I want to define a new grammar)?
-
1\$\begingroup\$ I believe you misunderstood what I mean with header-only. The parser generator library itself is header only, meaning you only have to include the header to use it in your project. You can then define the grammar where you want, usually in implementation files. Secondly, you are right, there is no separate lexing step required as you can simply define the tokens in context and mix them with the language. Thirdly, I'm not sure I understand your question, but you don't need to use c++11 lambda functions (normal function pointers would also work). \$\endgroup\$Lars Melchior– Lars Melchior2014年12月18日 11:20:56 +00:00Commented Dec 18, 2014 at 11:20
using namespace std
going to use it back in my code too. \$\endgroup\$