I am experimenting with LL(1) parsers that do not use a separate lexer.
I have the following grammar:
S = (('<' S '>' | 'a'+) ' '?)+
The notation uses ' guotes to indicate terminals and familiar regex like syntax for union, grouping, repetition, optional parts etc. It could also be rewritten more like BNF as follows:
S = F OWS Q
Q = F OWS Q
Q =
F = '<' S '>'
F = 'a' G
G = 'a'
G =
OWS = ' '
OWS =
This grammar is not LL(1) because, since the spaces separating the sequences of 'a's is optional, it would not be possible to distinguish between multiple concatenated sequences or even one large sequence.
The grammar is intended to parse languages like in the following lines:
aaa aaa aaa
a a <a a <a>a aa>
a<a>a<aaaaa><a<<<a>>>>
That is, consecutive occurrences of 'a' separated by spaces and or matching brackets.
If I were to add a lexer to greedily tokenize sequences of 'a' into the single token TK_IDENTIFIER the grammar becomes LL(1) again.
S = (('<' S '>' | TK_IDENTIFIER) ' '?)+
however I have found another way:
The rules for building the LL(1) parse table are as follows:
T[A][c] contains the rule A = w if and only if : (c is in First(w)) or (w is nullable and c is in Follow(A))
Having two productions in the same cell in the table is illegal. However, If I say that a non empty production is chosen in favor of an empty one I can produce a table that works.
What is the effect of modifying this rule in terms of the languages that can be parsed? In terms of the meaning of the grammar?
Is it correct to say that this modification is a method of resolving ambiguity? A specific case of ambiguity maybe?
Is there an obvious defect in this approach? any pathologically disgusting grammars you can think of that would produce unexpected results?
Does modifying this rule have the same effect as using a lexer that is purely defined by regular expressions? eg: r/a+/ in this case.
Consider an even simpler example:
S = ('a'+)+
-- or --
S = T U
U = T U
U =
T = 'a' V
V = 'a' V
V =
and with a tokenizer
S = (TK_IDENTIFIER)+
-- or --
S = TK_IDENTIFIER U
U = TK_IDENTIFIER U
U =