1
$\begingroup$

I am experimenting with LL(1) parsers that do not use a separate lexer.

I have the following grammar:

S = (('<' S '>' | 'a'+) ' '?)+

The notation uses ' guotes to indicate terminals and familiar regex like syntax for union, grouping, repetition, optional parts etc. It could also be rewritten more like BNF as follows:

S = F OWS Q
Q = F OWS Q
Q = 
F = '<' S '>'
F = 'a' G
G = 'a'
G = 
OWS = ' '
OWS = 

This grammar is not LL(1) because, since the spaces separating the sequences of 'a's is optional, it would not be possible to distinguish between multiple concatenated sequences or even one large sequence.

The grammar is intended to parse languages like in the following lines:

aaa aaa aaa
a a <a a <a>a aa>
a<a>a<aaaaa><a<<<a>>>>

That is, consecutive occurrences of 'a' separated by spaces and or matching brackets.

If I were to add a lexer to greedily tokenize sequences of 'a' into the single token TK_IDENTIFIER the grammar becomes LL(1) again.

S = (('<' S '>' | TK_IDENTIFIER) ' '?)+

however I have found another way:

The rules for building the LL(1) parse table are as follows:

T[A][c] contains the rule A = w if and only if : (c is in First(w)) or (w is nullable and c is in Follow(A))

Having two productions in the same cell in the table is illegal. However, If I say that a non empty production is chosen in favor of an empty one I can produce a table that works.

What is the effect of modifying this rule in terms of the languages that can be parsed? In terms of the meaning of the grammar?

Is it correct to say that this modification is a method of resolving ambiguity? A specific case of ambiguity maybe?

Is there an obvious defect in this approach? any pathologically disgusting grammars you can think of that would produce unexpected results?

Does modifying this rule have the same effect as using a lexer that is purely defined by regular expressions? eg: r/a+/ in this case.


Consider an even simpler example:

S = ('a'+)+
-- or --
S = T U
U = T U
U = 
T = 'a' V
V = 'a' V
V = 

and with a tokenizer

S = (TK_IDENTIFIER)+
-- or --
S = TK_IDENTIFIER U
U = TK_IDENTIFIER U
U = 
asked Jun 30, 2024 at 13:36
$\endgroup$

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.