When building the DFA for the lexer of my programming language, should each character (e.g., n, i, f) appear as a single shared node across all token paths, or should I allow duplicate nodes for the same character if they appear in different paths (e.g., separate n nodes for int, return, and blank)?
What are the trade-offs between these two approaches in terms of implementation efficiency, clarity, and correctness?
I added a picture to visualize what I meant.
1 Answer 1
A node in a DFA representation corresponds to a state. If you want to represent two distinct states, you need at least two distinct nodes.
So when your language has the words return and func, and you have so far read ret, you are in a different state than if you would have read f, even though in both states you have a transition for the letter u. The thing is, after you have processed that u you must still know whether you were in the process of completing the word return or func, otherwise you would also accept the words furn and retunc. So no, you cannot consider these two states to be equal.
Realise that a state does not relate directly to the next character that you may accept. It is more correct to think of it as representing what you already have read.
So in your DFA graph, you could combine the nodes i and i3 into one node, and then have two outgoing transitions from there to proceed to either if or int. You have already applied this principle for the starting letter e, but could have continued doing that for the subsequent l (merging l1 and l2) and s (merging s1 and s2), so that the split would only occur at that latter state into e (for else) and i (for elsif).