5
\$\begingroup\$

I wrote a tokenizer/lexer (difference?) for Python in Haskell: this is my code on GitHub.

I already tested it out with some of CPython's standard library scripts, and it doesn't fail. However, it probably outputs non-conforming output. But I don't care about this right now.

What I'm looking for is a general (functional) style review. This is my first time using Parsec and Haskell for something relatively big, so I expect my functional style to be suboptimal.

Significant code snippets:

parseTokens :: LexemeParser [PositionedToken]
parseTokens = do
 P.skipMany ignorable
 tokens <- concat <$> P.many parseLogicalLine
 P.eof
 endIndents <- length . filter (>0) <$> getIndentLevels
 dedents <- mapM position $ replicate endIndents Dedent
 endMarker <- position EndMarker
 return (tokens ++ dedents ++ [endMarker])
parseIndentation :: LexemeParser [PositionedToken]
parseIndentation = mapM position =<< go
 where
 go = do
 curr <- length <$> P.many (P.char ' ')
 prev <- getIndentLevel
 case curr `compare` prev of
 EQ -> return []
 GT -> return [Indent] <* pushIndentLevel curr
 LT -> do
 levels <- getIndentLevels
 when (not $ elem curr levels) (fail $ "indentation must be: " ++ show levels)
 let (pop, levels') = span (> curr) levels
 putIndentLevels levels'
 return $ replicate (length pop) Dedent
parseLogicalLine :: LexemeParser [PositionedToken]
parseLogicalLine = do
 P.skipMany ignorable
 indentation <- parseIndentation
 (indentation ++) <$> begin
 where
 begin = do
 tokens <- P.many1 parsePositionedToken
 let continue = (tokens ++) <$> begin
 implicitJoin <- (> 0) <$> getOpenBraces
 if implicitJoin then do
 whitespace
 parseComment <|> (P.optional backslash *> eol)
 P.skipMany ignorable
 whitespace
 continue
 else do
 whitespace
 explicitJoin <- P.optionMaybe (backslash *> eol)
 case explicitJoin of
 Nothing -> ((\t -> tokens ++ [t]) <$> position NewLine) <* P.optional eol
 otherwise -> whitespace *> continue

(I was largely inspired by PureScript's lexer, though I probably ended up with something a lot worse.)

200_success
145k22 gold badges190 silver badges478 bronze badges
asked Aug 22, 2015 at 23:18
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

for comparison

Here are some other ways I've found to parse indentation sensitive language:

  1. Use Text.Parsec.Indent Here is a blog post containing an example: (link)

  2. Have a look at the language-python package. It's lexer is built using alex: (link)

  3. Purescript handles indentation via the mark, checkIndentation, indented and same functions: (link) Grep the rest of the source code for these functions to see how they are used. Its lexer defines a Indent token (link) but from what I can tell it is never returned, i.e. see (link). The indentation level is kept track of by the parser (link) via the mark function.

returning a list of tokens?

I'm not sure there is a lot of value in parsers which return [PositionedToken] except, perhaps, for testing purposes.

Most of the time you want to run a token parser incrementally, but parseTokens has to consume the entire file before it can return anything.

There should be a function which just returns the next token. This will come in handy in conjunction with the next comment I have...

handling dedents at eof

I would find another way to handle implicit dedents at the end of the file. I'm sure this can be handled at the grammar level. The code to add them is extremely messy - don't you agree?

Have a look at how the language-python package does it:

https://github.com/bjpop/language-python/blob/master/src/Language/Python/Version2/Parser/Lexer.x#L229-L242

The function lexToken returns just a single token. The lexer has state, so when it reaches EOF it checks the current indentation level. If the indentation level is> 1, it decrements it and returns a dedentToken. Otherwise it returns an EOF token.

answered Sep 11, 2015 at 3:32
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.