I'm trying to write a (scannerless) recursive descent parser with a "catch all" rule for the following "Mustache template" grammar (simplified here):
content : (variable_tag | section_tag | static)*
variable_tag : mustache_open id mustache_close
section_tag : mustache_open '#' id mustache_close
content mustache_open '/' id mustache_close
mustache_open : '{{'
mustache_close : '}}'
id : [a–zA–Z$_][a–zA–Z0–9$_]*
static : .+ // "catch all"
The static
production would have to stop before the next matched production. And I can't come to a solution for this that would not break the grammar structure.
A valid input is:
You have just won {{value}} dollars!
{{#in_ca}}
Well, {{taxed_value}} dollars, after taxes.
{{/in_ca}}
The output for that would be an abstract syntactic tree like:
+-------+
|Content|
+-------+
|
+---------------+------+--------+--------------+
| | | |
+--------------+ +-----------+ +-------------+ +----------+
|Static | |VariableTag| |Static | |SectionTag|
|"\nYou...won "| |value | |" dollars"\n"| |in_ca |
+--------------+ +-----------+ +-------------+ +----------+
|
+-------+
|Content|
+-------+
|
+-------------+----+--------------+
| | |
+----------+ +-----------+ +---------------------+
|Static | |VariableTag| |Static |
|"\nWell, "| |taxed_value| |" dollars...taxes.\n"|
+----------+ +-----------+ +---------------------+
Any reference to a implementation of a "catch all" rule for a recursive descent parser?
2 Answers 2
Your grammar will be correct if your catch-all rule only consumes a single character and if you use prioritized choice in the grammar so that static
is only attempted as a last resort. Of course, these single-character tokens are terribly inefficient. There are two ways to fix this:
- let the catch-all rule be
static : [^\{]+ | .
- have your parser store a pointer to the previous token. If you are reading a static token and the previous token was static, then append the string rather than creating a new item in your AST. This can actually be a bit tricky to pull off, since in a RecDesc parser each rule would usually fail or return an AST subtree. It might make sense to introduce a pseudo-token signifying "success, but no action needes" since the token would already reside in the AST in case of a fixup and should not be added two times.
There is a problem with this: since at no point do you commit to one alternative, this parser involves a lot of backtracking. I would expect this to cause exponential complexity, unless you use a Packrat strategy (or use a parsing technique that is actually suited for ambiguous grammars). Here's an example input for such behaviour:
{{#foo}} {{#a}} {{#a}} {{#a}} {{/foo} sic!
According to your grammar, this input would be completely static, but the parser would recurse four times into the section rule.
For this particular grammar, I think you can just match until you see {{
. Then push {{
back on the tokens to be read.
So I think the key here is just having the ability to push back onto the token list. Having a fixed buffer of tokens is generally good for error handling too, in LR parsers, but I wouldn't be surprised if it is useful in LL parsers as well.
-
That doesn't work as the grammar is defined, consider
abc{{123
- that should matchstatic
.Telastyn– Telastyn2015年05月28日 00:54:24 +00:00Commented May 28, 2015 at 0:54 -
And that's true. But it does seem like the easiest way to solve this is probably some form of tokenization at the expense of the pretty grammar. I think tokenization can likely be done efficiently and then make matching something not too far off from the original grammar much easier. How would you approach it?J Trana– J Trana2015年05月28日 01:40:54 +00:00Commented May 28, 2015 at 1:40
static
match.
not.*
since.*
would require arbitrary lookahead to know when to stop..+
.+
suffers from the same problems.