"Hand written" recursive descent parser with "catch all" rule

Question 1

I'm trying to write a (scannerless) recursive descent parser with a "catch all" rule for the following "Mustache template" grammar (simplified here):

 content : (variable_tag | section_tag | static)*
 variable_tag : mustache_open id mustache_close
 section_tag : mustache_open '#' id mustache_close
 content mustache_open '/' id mustache_close
 mustache_open : '{{'
mustache_close : '}}'
 id : [a–zA–Z$_][a–zA–Z0–9$_]*
 static : .+ // "catch all"

The static production would have to stop before the next matched production. And I can't come to a solution for this that would not break the grammar structure.

A valid input is:

You have just won {{value}} dollars!
{{#in_ca}}
Well, {{taxed_value}} dollars, after taxes.
{{/in_ca}}

The output for that would be an abstract syntactic tree like:

 +-------+
 |Content|
 +-------+
 |
 +---------------+------+--------+--------------+
 | | | | 
+--------------+ +-----------+ +-------------+ +----------+
|Static | |VariableTag| |Static | |SectionTag|
|"\nYou...won "| |value | |" dollars"\n"| |in_ca |
+--------------+ +-----------+ +-------------+ +----------+
 | 
 +-------+
 |Content|
 +-------+
 |
 +-------------+----+--------------+
 | | |
 +----------+ +-----------+ +---------------------+
 |Static | |VariableTag| |Static |
 |"\nWell, "| |taxed_value| |" dollars...taxes.\n"|
 +----------+ +-----------+ +---------------------+

Any reference to a implementation of a "catch all" rule for a recursive descent parser?

Question 2

I would look to make static match . not .* since .* would require arbitrary lookahead to know when to stop.

Question 3

Corrected, that should be .+

Question 4

.+ suffers from the same problems.

Question 5

Your grammar will be correct if your catch-all rule only consumes a single character and if you use prioritized choice in the grammar so that static is only attempted as a last resort. Of course, these single-character tokens are terribly inefficient. There are two ways to fix this:

let the catch-all rule be static : [^\{]+ | .
have your parser store a pointer to the previous token. If you are reading a static token and the previous token was static, then append the string rather than creating a new item in your AST. This can actually be a bit tricky to pull off, since in a RecDesc parser each rule would usually fail or return an AST subtree. It might make sense to introduce a pseudo-token signifying "success, but no action needes" since the token would already reside in the AST in case of a fixup and should not be added two times.

There is a problem with this: since at no point do you commit to one alternative, this parser involves a lot of backtracking. I would expect this to cause exponential complexity, unless you use a Packrat strategy (or use a parsing technique that is actually suited for ambiguous grammars). Here's an example input for such behaviour:

{{#foo}} {{#a}} {{#a}} {{#a}} {{/foo} sic!

According to your grammar, this input would be completely static, but the parser would recurse four times into the section rule.

Question 6

For this particular grammar, I think you can just match until you see {{. Then push {{ back on the tokens to be read.

So I think the key here is just having the ability to push back onto the token list. Having a fixed buffer of tokens is generally good for error handling too, in LR parsers, but I wouldn't be surprised if it is useful in LL parsers as well.

Question 7

That doesn't work as the grammar is defined, consider abc{{123 - that should match static.

Question 8

And that's true. But it does seem like the easiest way to solve this is probably some form of tokenization at the expense of the pretty grammar. I think tokenization can likely be done efficiently and then make matching something not too far off from the original grammar much easier. How would you approach it?

amon amon 136k27 gold badges295 silver badges386 bronze badges · Accepted Answer · 2015-05-28 05:50:08Z

Your grammar will be correct if your catch-all rule only consumes a single character and if you use prioritized choice in the grammar so that static is only attempted as a last resort. Of course, these single-character tokens are terribly inefficient. There are two ways to fix this:

let the catch-all rule be static : [^\{]+ | .
have your parser store a pointer to the previous token. If you are reading a static token and the previous token was static, then append the string rather than creating a new item in your AST. This can actually be a bit tricky to pull off, since in a RecDesc parser each rule would usually fail or return an AST subtree. It might make sense to introduce a pseudo-token signifying "success, but no action needes" since the token would already reside in the AST in case of a fixup and should not be added two times.

There is a problem with this: since at no point do you commit to one alternative, this parser involves a lot of backtracking. I would expect this to cause exponential complexity, unless you use a Packrat strategy (or use a parsing technique that is actually suited for ambiguous grammars). Here's an example input for such behaviour:

{{#foo}} {{#a}} {{#a}} {{#a}} {{/foo} sic!

According to your grammar, this input would be completely static, but the parser would recurse four times into the section rule.

Stack Exchange Network

"Hand written" recursive descent parser with "catch all" rule

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

"Hand written" recursive descent parser with "catch all" rule

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions