I found myself wanting to brush up on some notions about parsers and grammars and, at the same time, to exercise my Haskell - I am a Haskell newbie; moreover, I haven't touched the language at all in a couple months.
You can find the complete result here, and, separated by hrs and without the module boilerplate, pasted below.
I'm not exactly happy with it.
It seems somewhat dirty and convoluted for something that is stated so simply in a few lines of English in the reference textbook - (meaning, the (Purple) Dragon Book).
I am also noticing a complete lack of typically functional constructs - function composition, curried functions, etc - which might be a code smell.
On the other hand, after spending a couple nights on it, I can't seem to find any more opportunity for improvement: I'm stuck.
Any suggestions both in the small and in the large scale are appreciated.
---------------------------------
-- Datatypes and utilities
---------------------------------
data Terminal = T String | Dollar | Epsilon deriving (Show, Eq, Ord)
data Nonterminal = NT String | Start String deriving (Show, Eq, Ord)
type Symbol = (Either Nonterminal Terminal)
data Production = Prod {left_hand :: Nonterminal, right_hand :: [Symbol]} deriving (Show, Eq)
data Grammar = Grammar [Production] deriving (Show)
allTerminalsInGrammar :: Grammar -> Set Terminal
allTerminalsInGrammar (Grammar []) = Data.Set.empty
allTerminalsInGrammar (Grammar (prod:prods)) =
(
Data.Set.union
(allTerminalsInGrammar (Grammar prods))
(Data.Set.fromList $ onlyTerminals (right_hand prod))
)
where onlyTerminals a = rights a
allNonterminalsInGrammar :: Grammar -> Set Nonterminal
allNonterminalsInGrammar (Grammar []) = Data.Set.empty
allNonterminalsInGrammar (Grammar (prod:prods)) =
(
Data.Set.union
(allNonterminalsInGrammar (Grammar prods))
(Data.Set.singleton (left_hand prod))
)
type FirstSet = (Set Terminal)
type FollowSet = (Set Terminal)
data FirstMap = FirstMap (Data.Map.Map Nonterminal (Set Terminal)) deriving (Show, Eq)
-- A FirstMap maps a symbol to its FirstSet
data FollowMap = FollowMap (Data.Map.Map Nonterminal (Set Terminal)) deriving (Show, Eq)
-- A FollowMap maps a symbol to its FollowSet
getFirstSetFor :: Nonterminal -> FirstMap -> FirstSet
getFirstSetFor nonterminal (FirstMap firstmap) =
f $ Data.Map.lookup nonterminal firstmap
where
f Nothing = Data.Set.empty
f (Just set) = set
getFollowSetFor :: Nonterminal -> FollowMap -> FollowSet
getFollowSetFor nonterminal (FollowMap firstmap) =
f $ Data.Map.lookup nonterminal firstmap
where
f Nothing = Data.Set.empty
f (Just set) = set
mergeSetIntoFirstMap :: FirstMap -> Nonterminal -> Set Terminal -> FirstMap
mergeSetIntoFirstMap (FirstMap map) nt t = FirstMap (insertWith (Data.Set.union) nt t map)
mergeSetIntoFollowMap :: FollowMap -> Nonterminal -> Set Terminal -> FollowMap
mergeSetIntoFollowMap (FollowMap map) nt t = FollowMap (insertWith (Data.Set.union) nt t map)
---------------------------------------------------------
fixpoint :: Eq a => (a -> a) -> a -> a
fixpoint f x = let x' = f x in if x == x' then x else fixpoint f x'
----------------------------------------------------
-- FIRST0 gives ∅ to each nonterminal;
--
-- Then repeat:
--
-- If X is a terminal then FIRST(X) = {X}
-- If X is a nonterminal
-- If X -> ε is a production then add ε to FIRST(X)
-- If X->Y1 Y2 ... Yk for some k >= 1
-- If ε is in FIRST−1(Yj) for ALL j= 1...k add ε
-- If ε is in FIRST−1(Yi) for ALL i= 1...l < k
-- then add FIRST(Yl+1)
--
-- Stop when nothing can be added (i.e.: find a fixpoint)
first :: Grammar -> FirstMap
first grammar = (fixpoint (firsti grammar) (first0 grammar))
where
--------------------------------------------------------
first0 :: Grammar -> FirstMap
first0 grammar = FirstMap $
(Data.Map.fromList $ (fmap defaults) $ Data.Set.toList (allNonterminalsInGrammar grammar))
where
defaults nt = (nt, (Data.Set.empty))
--------------------------------------------------------
firsti :: Grammar -> FirstMap -> FirstMap
firsti (Grammar []) fMinus1 = fMinus1
firsti (Grammar (prod:prods)) fMinus1 =
(
mergeSetIntoFirstMap -- Add to the FIRST_(i-1) set,
(firsti (Grammar prods) fMinus1) -- (computed from the remaining productions)
(left_hand prod) -- for the symbol on the LHS of the current production prod,
(terminalsForRHS prod) -- the terminals obtained by applying the rules on the current production prod
)
where
--
terminalsForRHS :: Production -> Set Terminal
terminalsForRHS (Prod _ []) = (Data.Set.singleton Epsilon)
-- If X -> ε is a production then add ε to FIRST(X)
terminalsForRHS (Prod _ [(Right Epsilon)]) = (Data.Set.singleton Epsilon)
terminalsForRHS (Prod x (yj:ys)) =
case yj of
-- If X is a terminal then FIRST(X) = {X}
Right terminal -> Data.Set.singleton terminal
-- Otherwise...
Left nonterminal -> if (Data.Set.member Epsilon first_iminus1)
then (Data.Set.union
(terminalsForYj)
(terminalsForRHS (Prod x ys)))
else (terminalsForYj)
where
first_iminus1 = getFirstSetFor nonterminal fMinus1
terminalsForYj = (Data.Set.delete Epsilon first_iminus1)
-- If ε is in FIRST−1(Yi) for ALL i= 1...l < k
-- then add FIRST(Yl+1)
-- --> guaranteed because we add FIRST(Yl+1) and we
-- continue adding terminalsRHS for the
-- remainder of the production only if ε is in FIRST−1
-- If ε is in FIRST−1(Yj) for ALL j= 1...k add ε
-- --> guaranteed because we continue adding terminalsRHS for the
-- remainder of the production only if ε is in FIRST−1
-- eventually adding terminalsForRHS (Prod _ []) = (Data.Set.singleton Epsilon)
-------------------------------------------------------------------------
-- We can compute FIRST for any string X1X2...Xn as follows.
-- Add to FIRST(X1X2... Xn) all non-ε symbols of FIRST(X1).
-- Also add the non-ε symbols of FIRST(X2), if ε is in FIRST(X1);
-- and so on.
-- Finally, add ε to FIRST(X1X2 . . Xn) if, for all i, ε is in FIRST(Xi)
-------------------------------------------------------------------------
firstForWord :: [Symbol] -> FirstMap -> FirstSet
firstForWord [(Left nt)] first = getFirstSetFor nt first -- If Epsilon in first don't remove
firstForWord [(Right t)] first = (Data.Set.singleton t)
firstForWord ((Left nt):ss) first =
if (Data.Set.member
Epsilon
(getFirstSetFor nt first)
)
then (Data.Set.union
firstMinusEpsilon
(firstForWord ss first)
)
else firstMinusEpsilon
where firstMinusEpsilon = (Data.Set.delete
Epsilon
(getFirstSetFor nt first)
)
----------------------------------------------------
-- FOLLOW0 gives $ to FOLLOW(S) and {} to everything else;
--
-- Then repeat:
--
-- If there is a production A -> w1Bw2 ,
-- then everything in FIRST(w2) except ε is in FOLLOW(B)
-- If where FIRST(w2) contains Epsilon,
-- then everything in FOLLOW (A) is also in FOLLOW (B).
-- If there is a production A -> w1B,
-- then everything in FOLLOW (A) is in FOLLOW (B).
--
-- Stop when nothing can be added (i.e. find a fixpoint)
-----------------------------------------------------
follow :: Grammar -> FollowMap
follow grammar = (fixpoint (followi grammar (first grammar)) (follow0 grammar))
where
--------------------------------------------------------
follow0 :: Grammar -> FollowMap
follow0 grammar = FollowMap $
(Data.Map.fromList $ (fmap defaults) $ Data.Set.toList (allNonterminalsInGrammar grammar))
where
defaults nt@(NT _) = (nt, (Data.Set.empty))
defaults nt@(Start _) = (nt, (Data.Set.singleton Dollar))
--------------------------------------------------------
followi :: Grammar -> FirstMap -> FollowMap -> FollowMap
followi (Grammar []) _ fMinus1 = fMinus1
followi (Grammar (prod:prods)) firstMapForG fMinus1 =
(mergeTerminalsFromProd -- Add the terminals obtained by applying the rules on the current production
prod
(followi --
(Grammar prods) --
firstMapForG -- to the FOLLOW sets obtained from the remaining productions
fMinus1 --
)
)
where
mergeTerminalsFromProd :: Production -> FollowMap -> FollowMap
mergeTerminalsFromProd (Prod l (s:ss)) = (mergeTerminalsFromProd' l [] s ss)
mergeTerminalsFromProd' :: Nonterminal -> [Symbol] -> Symbol -> [Symbol] -> FollowMap -> FollowMap
-- If there is a production A -> w1Bw2 ,
mergeTerminalsFromProd' a w1 nt@(Left b) w2@(w21:w2s) fMinus1 = (mergeSetIntoFollowMap
(mergeTerminalsFromProd' a (w1 ++ [nt]) w21 w2s fMinus1)
b
newTerminals
)
where
-- everything in FIRST(w2) except ε is in FOLLOW(B)
firstW2MinusEpsilon = (Data.Set.delete Epsilon (firstForWord w2 firstMapForG))
newTerminals = if (Data.Set.member Epsilon (firstForWord w2 firstMapForG))
-- If where FIRST(w2) contains Epsilon,
-- then everything in FOLLOW (A) is also in FOLLOW (B).
then (Data.Set.union (getFollowSetFor a fMinus1) firstW2MinusEpsilon)
else (firstForWord w2 firstMapForG)
-- If there is a production A -> w1B, then everything in FOLLOW (A) is in FOLLOW (B).
mergeTerminalsFromProd' a w1 (Left b) [] fMinus1 = (mergeSetIntoFollowMap
fMinus1
b
(getFollowSetFor a fMinus1)
)
-- We don't have to insert anything for terminals; i.e. there are no rules
-- for A -> w1Tw2 where T is terminal; if we are at the end of the production we're done
mergeTerminalsFromProd' a _ (Right _) [] fMinus1 = fMinus1
-- Otherwise just recurse
mergeTerminalsFromProd' a w1 (Right t) (w2:w2s) fMinus1 = (mergeTerminalsFromProd' a (w1 ++ [(Right t)]) w2 w2s fMinus1)
main = let
-- Example from dragon book
dragon_gram = Grammar [p1, p2, p3, p4, p5, p6, p7, p8] where
p1 = Prod {left_hand = (Start "E"), right_hand = [Left (NT "T"), Left (NT "E'")]}
p2 = Prod {left_hand = (NT "E'"), right_hand = [Right (T "+"), Left (NT "T"), Left (NT "E'")]}
p3 = Prod {left_hand = (NT "E'"), right_hand = [Right Epsilon]}
p4 = Prod {left_hand = (NT "T"), right_hand = [Left (NT "F"), Left (NT "T'")]}
p5 = Prod {left_hand = (NT "T'"), right_hand = [Right (T "*"), Left (NT "F"), Left (NT "T'")]}
p6 = Prod {left_hand = (NT "T'"), right_hand = [Right Epsilon]}
p7 = Prod {left_hand = (NT "F"), right_hand = [Right (T "("), Left (Start "E"), Right (T ")")]}
p8 = Prod {left_hand = (NT "F"), right_hand = [Right (T "id")]}
in do
putStrLn "EXAMPLE GRAMMAR: "
print (dragon_gram)
putStrLn "FIRST(G) = "
print (first dragon_gram)
putStrLn "FOLLOW(G) = "
print (follow dragon_gram)
1 Answer 1
I refactored for a while, largely replacing recursion with combinators. I think all these type aliases are unnecessary, I probably would have removed them if I had finished this. You didnt actually use the LHS in terminalsForRHS
, except to pass it into recursive calls, so I took it out.
module First where
import Data.Map
import Data.Maybe
import Data.Either
import Data.Set
import Control.Arrow
---------------------------------
-- Datatypes and utilities
---------------------------------
data Terminal = T String | Dollar | Epsilon deriving (Show, Eq, Ord)
data Nonterminal = NT String | Start String deriving (Show, Eq, Ord)
type Symbol = Either Nonterminal Terminal
type Production = (Nonterminal, Symbol)
type Grammar = [Production]
allTerminalsInGrammar :: Grammar -> Set Terminal
allTerminalsInGrammar = foldMap $ Data.Set.fromList . rights . snd
allNonterminalsInGrammar :: Grammar -> Set Nonterminal
allNonterminalsInGrammar = Data.Set.fromList . map fst
type FFSet = Set Terminal
type FFMap = Data.Map.Map Nonterminal FFSet
-- A First/FollowMap maps a symbol to its First/FollowSet
getSetFor :: Nonterminal -> FFMap -> FFSet
getSetFor nt = fold . Data.Map.lookup nt
mergeSetIntoFFMap :: Nonterminal -> Set Terminal -> FFMap -> FFMap
mergeSetIntoFFMap = insertWith Data.Set.union
---------------------------------------------------------
-----------------------
-- Main program
-----------------------
fixpoint :: Eq a => (a -> a) -> a -> a
fixpoint f x = let x' = f x in if x == x' then x else fixpoint f x'
----------------------------------------------------
-- FIRST0 gives ∅ to each nonterminal;
--
-- Then repeat:
--
-- If X is a terminal then FIRST(X) = {X}
-- If X is a nonterminal
-- If X -> ε is a production then add ε to FIRST(X)
-- If X->Y1 Y2 ... Yk for some k >= 1
-- If ε is in FIRST−1(Yj) for ALL j= 1...k add ε
-- If ε is in FIRST−1(Yi) for ALL i= 1...l < k
-- then add FIRST(Yl+1)
--
-- Stop when nothing can be added (i.e.: find a fixpoint)
----------------------------------------------------
first :: Grammar -> FirstMap
first grammar = fixpoint (`firsti` grammar) Data.Map.empty where
firsti :: FirstMap -> Grammar -> FirstMap
firsti fMinus1 = fromListWith Data.Set.union . (map . right)
-- The terminals obtained by applying the rules on the current production prod
(foldr terminalsForRHS $ Data.Set.singleton Epsilon) where
terminalsForRHS :: Symbol -> Set Terminal -> Set Terminal
-- If X is a terminal then FIRST(X) = {X}
terminalsForRHS (Right terminal) _ = Data.Set.singleton terminal
terminalsForRHS (Left nonterminal) foo = Data.Set.union
( Data.Set.delete Epsilon first_iminus1)
if Data.Set.member Epsilon first_iminus1
then foo
else Data.Set.empty
where
first_iminus1 = getSetFor nonterminal fMinus1
-- If ε is in FIRST−1(Yi) for ALL i= 1...l < k
-- then add FIRST(Yl+1)
-- --> guaranteed because we add FIRST(Yl+1) and we
-- continue adding terminalsRHS for the
-- remainder of the production only if ε is in FIRST−1
-- If ε is in FIRST−1(Yj) for ALL j= 1...k add ε
-- --> guaranteed because we continue adding terminalsRHS for the
-- remainder of the production only if ε is in FIRST−1
-- eventually adding terminalsForRHS (Prod _ []) = Data.Set.singleton Epsilon)
-------------------------------------------------------------------------
-- We can compute FIRST for any string X1X2...Xn as follows.
-- Add to FIRST(X1X2... Xn) all non-ε symbols of FIRST(X1).
-- Also add the non-ε symbols of FIRST(X2), if ε is in FIRST(X1);
-- and so on.
-- Finally, add ε to FIRST(X1X2 . . Xn) if, for all i, ε is in FIRST(Xi)
-------------------------------------------------------------------------
firstForWord :: [Symbol] -> FFMap -> FFSet
firstForWord [Left nt] first = getSetFor nt first -- If Epsilon in first don't remove
firstForWord [Right t] first = Data.Set.singleton t
firstForWord (Left nt:ss) first = Data.Set.union
( Data.Set.delete Epsilon $ getSetFor nt first)
if Data.Set.member Epsilon $ getSetFor nt first
then firstForWord ss first
else Data.Set.empty
Edit:
firstForWord :: FFMap -> [Symbol] -> FFSet
firstForWord first = unsnoc >>> \(Just (lefts, last)) ->
foldr foo (either (`getSetFor` first) Data.Set.singleton) lefts where
foo (Left nt) bar = Data.Set.union
( Data.Set.delete Epsilon $ getSetFor nt first)
if Data.Set.member Epsilon $ getSetFor nt first
then bar
else Data.Set.empty
And then, if you don't mind making it more defined, make this and the above similar snippet use the same first argument to foldr.
-
\$\begingroup\$ Thank you a lot, this is really a great help as it is. I had a few "d'oh" moments while reading it, I expect I'll have more when altering my code to reflect yours. Just one thing, though: why did you merge
FirstSet
andFollowSet
intoFFSet
? I originally kept them separate on purpose, to make sure not to use a FirstSet as a FollowSet or viceversa inside some nested call. Is that a bad Java-ism? Does Haskell provide a better way to avoid that? Thank you. \$\endgroup\$Tobia Tesan– Tobia Tesan2016年05月03日 17:02:43 +00:00Commented May 3, 2016 at 17:02 -
\$\begingroup\$ I find that these wrappers, that everyone invents anew, make it harder to see patterns that can be replaced by a combinator from some library that already did the heavy lifting. Mixing up which goes where shouldn't be much of a problem once the code has been reduced sufficiently. Repost if you find many reductions, I might see more on another pass. \$\endgroup\$Gurkenglas– Gurkenglas2016年05月03日 18:09:24 +00:00Commented May 3, 2016 at 18:09