A header-only linear-time C++11 PEG parser generator supporting left-recursion and grammar ambiguity

Question 1

I've rewritten my original parser generator to a header-only library which uses templates and functionals for better type safety and clarity. The generated parser creates an abstract syntax tree which can be evaluated effectively using functionals and a visitor pattern. The parser memorizes intermediate steps resulting in guaranteed linear parsing time (squared in worst-case if the grammar includes left-recursion). My goal is to create a general-purpose C++ parser generator with focus on usability. So far documentation is missing but I believe the usage should become more or less clear from the following example.

This code creates a simple calculator using left-recursion and C++11 functionals:

#include <iostream>
#include <unordered_map>
#include <cmath>
#include "parser/parser.h"
using namespace std;
using namespace lars;
int main(int argc, char ** argv){
 parsing_expression_grammar_builder<double> g;
 using expression = expression<double>;
 unordered_map<string,double> variables;
 g["Expression"] << "Set | Sum" << [ ](expression e){ e.value() = e[0].get_value(); };
 g["Set" ] << "Name '=' Sum" << [&](expression e){ variables[e[0].string()] = e[1].get_value(); };
 g["Sum" ] << "Add | Subtract | Product" << [ ](expression e){ e.value() = e[0].get_value(); };
 g["Add" ] << "Sum '+' Product" << [ ](expression e){ e.value() = e[0].get_value() + e[1].get_value(); };
 g["Subtract" ] << "Sum '-' Product" << [ ](expression e){ e.value() = e[0].get_value() - e[1].get_value(); };
 g["Product" ] << "Multiply | Divide | Exponent" << [ ](expression e){ e.value() = e[0].get_value(); };
 g["Multiply" ] << "Product '*' Exponent" << [ ](expression e){ e.value() = e[0].get_value() * e[1].get_value(); };
 g["Divide" ] << "Product '/' Exponent" << [ ](expression e){ e.value() = e[0].get_value() / e[1].get_value(); };
 g["Exponent" ] << "Power | Atomic" << [ ](expression e){ e.value() = e[0].get_value(); };
 g["Power" ] << "Atomic '^' Exponent" << [ ](expression e){ e.value() = pow(e[0].get_value(),e[1].get_value()); };
 g["Atomic" ] << "Number | Brackets | Variable" << [ ](expression e){ e.value() = e[0].get_value(); };
 g["Brackets" ] << "'(' Sum ')'" << [ ](expression e){ e.value() = e[0].get_value(); };
 g["Number" ] << "'-'? [0-9]+ ('.' [0-9]+)?" << [ ](expression e){ e.value() = stod(e.string()); };
 g["Variable" ] << "Name" << [&](expression e){ e.value() = variables[e[0].string()]; };
 g["Name" ] << "[a-zA-Z] [a-zA-Z0-9]*" ;
 g.set_starting_rule("Expression");
 g["Whitespace"] << "[ \t]";
 g.set_separator_rule("Whitespace");
 auto p = g.get_parser();
 while (true) {
 string str;
 cout << "> ";
 getline(cin,str);
 if(str == "q" || str == "quit")break;
 try {
 auto e = p.parse(str);
 cout << str << " = " << *e.evaluate() << endl;
 }
 catch (parser<double>::error e){
 cout << " ";
 for(auto i UNUSED :range(e.begin_position().character))cout << " ";
 for(auto i UNUSED :range(e.length()))cout << "~";
 cout << "^\n";
 cout << e.error_message() << " while parsing " << e.rule_name() << endl;
 }
 }
 return 0;
}

This code creates the same without left-recursion and using a visitor pattern:

#include <iostream>
#include <unordered_map>
#include <cmath>
#include "parser/parser.h"
using namespace std;
using namespace lars;
class math_visitor{
 double value;
 unordered_map<string,double> variables;
public:
 double get_value(){
 return value;
 }
 double get_value(expression<math_visitor> e){
 e.accept(this);
 return get_value();
 }
 void visit_number(expression<math_visitor> e){
 value = stod(e.string());
 }
 void visit_set_variable(expression<math_visitor> e){
 variables[e[0].string()] = get_value(e[1]);
 }
 void visit_variable(expression<math_visitor> e){
 value = variables[e[0].string()];
 }
 void visit_left_binary_operator_list (expression<math_visitor> e){
 double lhs = get_value(e[0]);
 for(auto i:range((e.size()-1)/2)*2+1){
 double rhs = get_value(e[i+1]);
 if(e[i].character()=='+'){ lhs = lhs + rhs; }
 else if(e[i].character()=='-'){ lhs = lhs - rhs; }
 else if(e[i].character()=='*'){ lhs = lhs * rhs; }
 else if(e[i].character()=='/'){ lhs = lhs / rhs; }
 else throw "undefined operator";
 }
 value = lhs;
 }
 void visit_exponent(expression<math_visitor> e){
 if(e.size() == 1) e[0].accept();
 else value = pow(get_value(e[0]), get_value(e[1]));
 }
};
int main(int argc, char ** argv){
 parsing_expression_grammar_builder<math_visitor> g;
 using expression = expression<math_visitor>;
 g["Expression"] << "Set | Sum" << [](expression e){ e[0].accept(); };
 g["Set" ] << "Name '=' Sum" << [](expression e){ e[0].visitor().visit_set_variable(e); };
 g["Sum" ] << "Product (AddSub Product)*" << [](expression e){ e.visitor().visit_left_binary_operator_list(e); };
 g["AddSub" ] << "[+-]" ;
 g["Product" ] << "Exponent (MulDiv Exponent)*" << [](expression e){ e.visitor().visit_left_binary_operator_list(e); };
 g["MulDiv" ] << "[*/]" ;
 g["Exponent" ] << "Atomic (('^' | '**') Exponent) | Atomic" << [](expression e){ e.visitor().visit_exponent(e); };
 g["Atomic" ] << "Number | Brackets | Variable" << [](expression e){ e[0].accept(); };
 g["Brackets" ] << "'(' Sum ')'" << [](expression e){ e[0].accept(); };
 g["Number" ] << "'-'? [0-9]+ ('.' [0-9]+)?" << [](expression e){ e.visitor().visit_number(e); };
 g["Variable" ] << "Name" << [](expression e){ e.visitor().visit_variable(e); };
 g["Name" ] << "[a-zA-Z] [a-zA-Z]*" ;
 g.set_starting_rule("Expression");
 g["Whitespace"] << "[ \t]";
 g.set_separator_rule("Whitespace");
 auto p = g.get_parser();
 math_visitor visitor;
 while (true) {
 string str;
 cout << "> ";
 getline(cin,str);
 if(str == "q" || str == "quit")break;
 cout << " -> ";
 try { p.parse(str).accept(&visitor); cout << visitor.get_value(); }
 catch (parser<math_visitor>::error e){ cout << e.error_message(); }
 cout << endl;
 }
 return 0;
}

Some languages (including C) use ambiguous grammars where an expression such as x*y obviously needs to be parsed differently if x is a) a variable or b) a type. To resolve this I introduced grammar filters which are functionals (expression)->bool called immediately after matching a production rule.

For example, a word in the following grammar will only register as a type if there is a type named as the word and otherwise as a variable:

 unordered_set<string> types;
 g["Expression"] << "Type | Variable";
 g["Type"] << "Name" >> [&](expression e)->bool{ return types.find(e[0].string()) != types.end(); };
 g["Variable"] << "Name" ;
 g["Name"] << "[a-zA-Z] [a-zA-Z0-9]*";

What I would like to know is: Is it clear what is happening here based on the examples? Can they or other syntax be improved? Do you think the project is useful?

The full source code is available at GitHib.

Question 2

Much easier to use Lex(Flex)/Yacc(Bison) to do this.

Question 3

IMHO Flex/Bison is something very different and a lot harder to learn and work with. First of all this is just a c++ header you include. In Flex/Bison you mix languages and need separate tools for generating. Secondly, PEG grammars differ greatly from Flex/Bison grammars since they don't require lexing. This makes them a lot more intuitive to construct. I am aware that this particular example is trivial in Bison, however in most cases using Flex actually becomes painfully more complicated (e.g. lexing a String with escaped characters).

Question 4

really i like using namespace std going to use it back in my code too.

Question 5

@MORTAL except here it is used in implementation files (that don't get included elsewhere) so if you know the consequences then that is fine.

Question 6

Just to be clear, are you asking for a review of the code you've posted here or the code at GitHub? If the latter, the question really needs to include the code to be reviewed.

Question 7

You have "data" embedded in code in the function visit_left_binary_operator_list. You could use an std::unordered_map<char, double(*)(double, double)> to help separate the data and the actual algorithm:

void visit_left_binary_operator_list (expression<math_visitor> e){
 static const std::unordered_map<char, double(*)(double, double)> binary_ops = {
 { '+', [](double x, double y) { return x + y; } },
 { '-', [](double x, double y) { return x - y; } },
 { '*', [](double x, double y) { return x * y; } },
 { '/', [](double x, double y) { return x / y; } }
 };
 double lhs = get_value(e[0]);
 for(auto i:range((e.size()-1)/2)*2+1){
 auto op = binary_ops.find(e[i].character());
 if (op == binary_ops.end()) {
 throw "undefined operator";
 }
 double rhs = get_value(e[i+1]);
 lhs = op->second(lhs, rhs);
 }
 value = lhs;
}

You may also want to find a suitable name for the expression \$ ((x-1)/2)*2+1 \$. Currently, it is not really readable, but it feels like an operation common enough in the wild to need a proper name.

Instead of having a special case for exponent, wouldn't it be better to have a visit_right_binary_operator_list as well? It would be a first step to some more generic handling of operators. Also, it would be more generic if operators could be std::string instances instead of char instances to handle multi-character operators.

To answer your question: while I don't have a full understanding of lexers and parsers, what your code does is clear to me. The only thing I had a problem with at first was the value()/get_value() whose naming can be a little bit ambiguous. But once you know visitors, regex and lambdas, you quickly have a good understanding of what is happening.

Question 8

+1 Great idea with the operator map! I should probably make something like range(begin,end,increment_size) to improve readability. For consistency maybe a right operator list would make more sense. I just omitted it here to keep the example short and illustrate that they don't require left-recusion. But I'll probably go for the more consistent way. I used char here because I wanted to show this function in the example but maybe string would show off the posibilites more ;-) . The visitor pattern was the part I was most worried about, I think I will write a tutorial to clarify. Thanks!

Question 9

@Lars I understand your choices. When I wrote my first expression parser (really ugly), I only used one-character operators and only implemented what I needed right now without thinking of the future. Using std::string should be cleaner now that we have the user-defined literal ""s anyway :)

Question 10

Maybe this should be a comment but why is it header-only?

More especially, why is the grammar (e.g. strings such as "Set | Sum") embedded in the header software, instead of being in a text/grammar file? I'm used to seeing the grammar in a file that's separate from the parser software.

My second comment is that you don't seem to have a separate lexxer and parser?

Thirdly it's not obvious at first glance what statements like ...

[](expression e){ e.visitor().visit_left_binary_operator_list(e); }

... are about. Yet apparently (since they're mixed-in to the grammar) I'd need to understand/write statements like that, if I want to use the parser with a new grammar (i.e. if I want to define a new grammar)?

Question 11

I believe you misunderstood what I mean with header-only. The parser generator library itself is header only, meaning you only have to include the header to use it in your project. You can then define the grammar where you want, usually in implementation files. Secondly, you are right, there is no separate lexing step required as you can simply define the tokens in context and mix them with the language. Thirdly, I'm not sure I understand your question, but you don't need to use c++11 lambda functions (normal function pointers would also work).

Morwenn Morwenn 20.2k3 gold badges69 silver badges132 bronze badges · Accepted Answer · 2014-12-18 11:34:20Z

You have "data" embedded in code in the function visit_left_binary_operator_list. You could use an std::unordered_map<char, double(*)(double, double)> to help separate the data and the actual algorithm:

void visit_left_binary_operator_list (expression<math_visitor> e){
 static const std::unordered_map<char, double(*)(double, double)> binary_ops = {
 { '+', [](double x, double y) { return x + y; } },
 { '-', [](double x, double y) { return x - y; } },
 { '*', [](double x, double y) { return x * y; } },
 { '/', [](double x, double y) { return x / y; } }
 };
 double lhs = get_value(e[0]);
 for(auto i:range((e.size()-1)/2)*2+1){
 auto op = binary_ops.find(e[i].character());
 if (op == binary_ops.end()) {
 throw "undefined operator";
 }
 double rhs = get_value(e[i+1]);
 lhs = op->second(lhs, rhs);
 }
 value = lhs;
}

You may also want to find a suitable name for the expression \$ ((x-1)/2)*2+1 \$. Currently, it is not really readable, but it feels like an operation common enough in the wild to need a proper name.

Instead of having a special case for exponent, wouldn't it be better to have a visit_right_binary_operator_list as well? It would be a first step to some more generic handling of operators. Also, it would be more generic if operators could be std::string instances instead of char instances to handle multi-character operators.

To answer your question: while I don't have a full understanding of lexers and parsers, what your code does is clear to me. The only thing I had a problem with at first was the value()/get_value() whose naming can be a little bit ambiguous. But once you know visitors, regex and lambdas, you quickly have a good understanding of what is happening.

+1 Great idea with the operator map! I should probably make something like range(begin,end,increment_size) to improve readability. For consistency maybe a right operator list would make more sense. I just omitted it here to keep the example short and illustrate that they don't require left-recusion. But I'll probably go for the more consistent way. I used char here because I wanted to show this function in the example but maybe string would show off the posibilites more ;-) . The visitor pattern was the part I was most worried about, I think I will write a tutorial to clarify. Thanks!
@Lars I understand your choices. When I wrote my first expression parser (really ugly), I only used one-character operators and only implemented what I needed right now without thinking of the future. Using std::string should be cleaner now that we have the user-defined literal ""s anyway :)

Stack Exchange Network

A header-only linear-time C++11 PEG parser generator supporting left-recursion and grammar ambiguity

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

A header-only linear-time C++11 PEG parser generator supporting left-recursion and grammar ambiguity

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions