Designing a DFA for a Lexer: Shared vs. Separate Character Nodes

Question 1

When building the DFA for the lexer of my programming language, should each character (e.g., n, i, f) appear as a single shared node across all token paths, or should I allow duplicate nodes for the same character if they appear in different paths (e.g., separate n nodes for int, return, and blank)?

What are the trade-offs between these two approaches in terms of implementation efficiency, clarity, and correctness?

enter image description here

I added a picture to visualize what I meant.

Question 2

A node in a DFA representation corresponds to a state. If you want to represent two distinct states, you need at least two distinct nodes.

So when your language has the words return and func, and you have so far read ret, you are in a different state than if you would have read f, even though in both states you have a transition for the letter u. The thing is, after you have processed that u you must still know whether you were in the process of completing the word return or func, otherwise you would also accept the words furn and retunc. So no, you cannot consider these two states to be equal.

Realise that a state does not relate directly to the next character that you may accept. It is more correct to think of it as representing what you already have read.

So in your DFA graph, you could combine the nodes i and i3 into one node, and then have two outgoing transitions from there to proceed to either if or int. You have already applied this principle for the starting letter e, but could have continued doing that for the subsequent l (merging l1 and l2) and s (merging s1 and s2), so that the split would only occur at that latter state into e (for else) and i (for elsif).

trincot 357k38 gold badges282 silver badges340 bronze badges · Accepted Answer · 2025-04-06 13:33:49Z

A node in a DFA representation corresponds to a state. If you want to represent two distinct states, you need at least two distinct nodes.

So when your language has the words return and func, and you have so far read ret, you are in a different state than if you would have read f, even though in both states you have a transition for the letter u. The thing is, after you have processed that u you must still know whether you were in the process of completing the word return or func, otherwise you would also accept the words furn and retunc. So no, you cannot consider these two states to be equal.

Realise that a state does not relate directly to the next character that you may accept. It is more correct to think of it as representing what you already have read.

So in your DFA graph, you could combine the nodes i and i3 into one node, and then have two outgoing transitions from there to proceed to either if or int. You have already applied this principle for the starting letter e, but could have continued doing that for the subsequent l (merging l1 and l2) and s (merging s1 and s2), so that the split would only occur at that latter state into e (for else) and i (for elsif).

CollectivesTM on Stack Overflow

Designing a DFA for a Lexer: Shared vs. Separate Character Nodes

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related