How to shorten this terrible HTTP header parser?

Question 1

I am trying to read a line ending in \r\n from a Handle. This is an HTTP header.

I’m fairly new to functional programming, so I tend to use a lot of case expressions and stuff. This makes the code very long and ugly.

handleRequest :: HostName -> Handle -> IO ()
handleRequest host handle = do
 requestLine <- readHeaderLine
 putStrLn $ requestLine ++ "\n-------------------"
 -- FIXME: This code is bad, and its author should feel bad.
 where
 readHeaderLine = do
 readHeaderLine' ""
 where
 readHeaderLine' s = do
 chr <- hGetChar handle
 case chr of
 '\r' -> do
 nextChr <- hGetChar handle
 case nextChr of
 '\n' -> return s
 _ -> readHeaderLine' $ s ++ [chr, nextChr]
 _ -> readHeaderLine' $ s ++ [chr]

How can I reduce the amount of case expressions in this code? I thought of using Parsec, but that seemed overkill to me for something this trivial, and I don’t know how well it works with Handles.

Question 2

The answer depends on your goal. If you want high performance or an elegant HTTP parser, you should choose different algorithm and libraries. If you just want to see how your current code can be cleaned up with only structural changes without changing the algoritm, it's a different story. Please specify.

Question 3

@nponeccop performance isn’t really an issue. The point is making the code more concise.

Question 4

Can't you just use hGetLine? There are many alternatives to your code. For example, you can just use hGetContent and parse the resulting list.

Question 5

@nponeccop hGetContent won’t work, since the length of the data isn’t known before the headers are parsed. hGetLine won’t work if the headers contain a newline.

Question 6

hGetContent doesn't require length to be known in advance.

Question 7

You can use hGetContents to read lazily from a handle, and the split package (which will hopefully be bundled into the next Haskell Platform release) provides some nice ways of dealing with lists.

import Data.List.Split (splitOn)
import System.IO (hGetContents)
... do
 ...
 contentLines <- splitOn "\r\n" <$> hGetContents handle

contentLines will contain a lazy list of the "entire contents" of that handle, split into chunks that were originally separated by \r\n.

Another approach is to use something like the conduit package. See, for example, Data.Conduit.Binary.lines. Keep an eye on conduit & friends; I get the feeling that within the next year or so the Haskell community will start to agree on the "best" implementation of this sort of abstraction, and then some good tutorials will inevitably follow.

Question 8

Here I assume you just want to see how your current code can be improved without changing the algoritm.

First of all, give HLint tool a chance to suggest you obvious improvements. In your case the only improvement was that do in do readHeaderLine' "" was redundant, so not much.

Second, in my opinion many small top-level definitions are better than few large ones. You can still control namespace pollution by not exporting definitions local to the module:

import System.IO
type HostName = String
handleRequest :: HostName -> Handle -> IO ()
handleRequest host handle = do
 requestLine <- readHeaderLine handle
 putStrLn $ requestLine ++ "\n-------------------"
-- FIXME: This code is bad, and its author should feel bad.
readHeaderLine handle = readHeaderLine' "" where
 readHeaderLine' s = do
 chr <- hGetChar handle
 case chr of
 '\r' -> do
 nextChr <- hGetChar handle
 case nextChr of
 '\n' -> return s
 _ -> readHeaderLine' $ s ++ [chr, nextChr]
 _ -> readHeaderLine' $ s ++ [chr]

Next, your nextChr <- hGetChar handle ; case nextChr of limes appear twice, so you can extract a function and reduce code duplication.

Thanks to purity, you can do pretty mechanically:

foo handle quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz

Now replace the two fragments with calls to foo. Let's start with inner one:

readHeaderLine handle = readHeaderLine' "" where
 readHeaderLine' s = do
 chr <- hGetChar handle
 case chr of
 '\r' -> do
 foo handle '\n' (return s) (readHeaderLine' $ s ++ [chr, nextChr])
 _ -> readHeaderLine' $ s ++ [chr]

Heh, it didn't work because nextChr is only known inside foo. No problem, pass it as parameter to baz branch and use lambda to catch it:

readHeaderLine handle = readHeaderLine' "" where
 readHeaderLine' s = do
 chr <- hGetChar handle
 case chr of
 '\r' -> do
 foo handle '\n' (return s) (\nextChr -> readHeaderLine' $ s ++ [chr, nextChr])
 _ -> readHeaderLine' $ s ++ [chr] 
foo handle quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz chr

So now you can do the outer one.

readHeaderLine handle = readHeaderLine' "" where
 readHeaderLine' s = do
 foo handle '\r' (foo handle '\n' (return s) (\nextChr -> readHeaderLine' $ s ++ [chr, nextChr])) (\chr -> readHeaderLine' $ s ++ [chr])

No luck again as first chr is nowhere to get from. Fortunately we know it's always '\r', so

readHeaderLine handle = readHeaderLine' "" where
 readHeaderLine' s = do
 foo handle '\r' (foo handle '\n' (return s) (\nextChr -> readHeaderLine' $ s ++ ['\r', nextChr])) (\chr -> readHeaderLine' $ s ++ [chr])

As line got too long, we can split it, removing another redundant do:

readHeaderLine handle = readHeaderLine' "" where
 readHeaderLine' s = foo handle '\r' haveCR noCR where
 haveCR = foo handle '\n' (return s) haveCRnoLF
 noCR chr = readHeaderLine' $ s ++ [chr]
 haveCRnoLF nextChr = readHeaderLine' $ s ++ ['\r', nextChr]

Now there are more repeated patterns to eliminate: foo handle and readHeaderLine' $ s ++. To remove foo handle we move foo back in and remove its handle parameter both from applications and definition as it's now accessible from closure:

readHeaderLine handle = readHeaderLine' "" where
 readHeaderLine' s = foo '\r' haveCR noCR where
 haveCR = foo '\n' (return s) haveCRnoLF
 noCR chr = readHeaderLine' $ s ++ [chr]
 haveCRnoLF nextChr = readHeaderLine' $ s ++ ['\r', nextChr]
 foo quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz chr

To eliminate readHeaderline' repeated patterns we extract them into recurse local function:

readHeaderLine handle = readHeaderLine' "" where
 readHeaderLine' s = foo '\r' haveCR noCR where
 haveCR = foo '\n' (return s) haveCRnoLF
 noCR chr = recurse [chr]
 haveCRnoLF nextChr = recurse ['\r', nextChr]
 recurse x = readHeaderLine' $ s ++ x
 foo quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz chr

This is how far you can get with mechanical code deduplication. Now it's time for heavier weapons. You can still:

separate recursive code from non-recursive code
separate monadic code from non-monadic code

The initial redHeaderLine' call can be implemented using recurse with an extra parameter:

readHeaderLine handle = recurse [] [] where
 recurse s x = readHeaderLine' $ s ++ x 
 readHeaderLine' s = foo '\r' haveCR noCR where
 haveCR = foo '\n' (return s) haveCRnoLF
 noCR chr = recurse s [chr]
 haveCRnoLF nextChr = recurse s ['\r', nextChr]
 foo quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz chr

Now we can inline readHeaderLine' as it is only applied once:

readHeaderLine handle = recurse [] [] where
 recurse s1 x = foo '\r' haveCR noCR where
 s = s1 ++ x
 haveCR = foo '\n' (return s) haveCRnoLF
 noCR chr = recurse s [chr]
 haveCRnoLF nextChr = recurse s ['\r', nextChr]
 foo quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz chr

And we can remove duplication of recurse s:

readHeaderLine handle = recurse [] [] where
 recurse s1 x = foo '\r' haveCR noCR where
 s = s1 ++ x
 rf = recurse s
 haveCR = foo '\n' (return s) haveCRnoLF
 noCR chr = rf [chr]
 haveCRnoLF nextChr = rf ['\r', nextChr]
 foo quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz chr

Now let's put return value of recurse into a local declaration g:

readHeaderLine handle = recurse [] [] where
 recurse s1 x = g where
 s = s1 ++ x
 rf = recurse s
 g = foo '\r' haveCR noCR
 haveCR = foo '\n' (return s) haveCRnoLF
 noCR chr = rf [chr]
 haveCRnoLF nextChr = rf ['\r', nextChr]
 foo quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz chr

Our goal is to divorse g from recurse. You can do it by adding parameters to both recurse and g and localizing identifiers used only in recurse and used only in g:

readHeaderLine handle = recurse g [] [] where recurse g s1 x = g rf s where s = s1 ++ x rf = recurse g s

g rf s = foo '\r' haveCR noCR where
 haveCR = foo '\n' (return s) haveCRnoLF
 noCR chr = rf [chr]
 haveCRnoLF nextChr = rf ['\r', nextChr]
 foo quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz chr

Now recurse is completely self-contained:

readHeaderLine handle = recurse g [] [] where g rf s = foo '\r' haveCR noCR where haveCR = foo '\n' (return s) haveCRnoLF noCR chr = rf [chr] haveCRnoLF nextChr = rf ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr

recurse g s1 x = g rf s where s = s1 ++ x rf = recurse g s

But g is still recursive: it has a nasty rf parameter which is an indirect recursive application. We need to move rf into recurse too. So here comes a trick: convert a function call into a constructor.

g can have only 3 return values: return s, rf [chr] and rf ['\r', nextChr]. We can represent them with a data type and return it instead of calling return or rf:

data Outcomes a b c = RF1 a | RF2 b | Return c
readHeaderLine handle = recurse g [] [] where
 g rf s = foo '\r' haveCR noCR where
 haveCR = foo '\n' (return $ Return s) haveCRnoLF
 noCR chr = return $ RF1 [chr]
 haveCRnoLF nextChr = return $ RF2 ['\r', nextChr]
 foo quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz chr
recurse g s1 x = analyzeOutcomes $ g rf s where
 s = s1 ++ x
 rf = recurse g s
 analyzeOutcomes outcomeM = do
 outcome <- outcomeM
 case outcome of
 RF1 a -> rf a
 RF2 a -> rf a
 Return a -> return a

Now rf parameter is unused, so we can clean the definitions of g and recurse:

data Outcomes a b c = RF1 a | RF2 b | Return c

readHeaderLine handle = recurse g [] [] where
 g s = foo '\r' haveCR noCR where
 haveCR = foo '\n' (return $ Return s) haveCRnoLF
 noCR chr = return $ RF1 [chr]
 haveCRnoLF nextChr = return $ RF2 ['\r', nextChr]
 foo quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz chr
recurse g s1 x = analyzeOutcomes $ g s where
 s = s1 ++ x
 rf = recurse g s
 analyzeOutcomes outcomeM = do
 outcome <- outcomeM
 case outcome of
 RF1 a -> rf a
 RF2 a -> rf a
 Return a -> return a

Now two more improvements: a) RF1 and RF2 outcomes can be joined into one outcome as they are handled uniformly and have the same types; b) the only reason we pass s is to return it in Return outcome, so we can eliminate s argument of g too.

data Outcomes a = RF a | Return
readHeaderLine handle = recurse g [] [] where
 g = foo '\r' haveCR noCR where
 haveCR = foo '\n' (return Return) haveCRnoLF
 noCR chr = return $ RF [chr]
 haveCRnoLF nextChr = return $ RF ['\r', nextChr]
 foo quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz chr
recurse g s1 x = analyzeOutcomes g where
 s = s1 ++ x
 rf = recurse g s
 analyzeOutcomes outcomeM = do
 outcome <- outcomeM
 case outcome of
 RF a -> rf a
 Return -> return s

Now rf and analyzeOutcomes are used only once and Outcomes type became the same as Maybe. So:

readHeaderLine handle = recurse g [] [] where
 g = foo '\r' haveCR noCR where
 haveCR = foo '\n' (return Nothing) haveCRnoLF
 noCR chr = return $ Just [chr]
 haveCRnoLF nextChr = return $ Just ['\r', nextChr]
 foo quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz chr
recurse outcomeM s1 x = do
 outcome <- outcomeM
 let s = s1 ++ x in case outcome of
 Just a -> recurse outcomeM s a
 Nothing -> return s

Now s1 and x are only used in recurse to construct s. We can then construct s outside of recurse and pass it. Also, g now can be inlined.

readHeaderLine handle = recurse (foo '\r' haveCR noCR) [] where
 haveCR = foo '\n' (return Nothing) haveCRnoLF
 noCR chr = return $ Just [chr]
 haveCRnoLF nextChr = return $ Just ['\r', nextChr]
 foo quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else baz chr
recurse outcomeM s = do
 outcome <- outcomeM
 case outcome of
 Just a -> recurse outcomeM (s ++ a)
 Nothing -> return s

Now note that outcomeM is just a constant and it is not changed across recursive calls. So we can proceed further with our splitting of recursive and non-recursive code:

recurse outcomeM s = f s where
 f s = do
 outcome <- outcomeM
 case outcome of
 Just a -> f (s ++ a)
 Nothing -> return s

And duplicate return $ Just can be moved inside foo:

readHeaderLine handle = recurse (foo '\r' haveCR noCR) [] where
 haveCR = foo '\n' (return Nothing) haveCRnoLF
 noCR chr = [chr]
 haveCRnoLF nextChr = ['\r', nextChr]
 foo quux bar baz = do
 chr <- hGetChar handle
 if chr == quux then bar else return (Just $ baz chr)

After renaming of nonsense identifiers in definition of foo we get:

readHeaderLine handle = recurse (match '\r' haveCR noCR) [] where
 haveCR = match '\n' (return Nothing) haveCRnoLF
 noCR chr = [chr]
 haveCRnoLF nextChr = ['\r', nextChr]
 match expectedChar onSuccess onFailure = do
 actualChar <- hGetChar handle
 if actualChar == expectedChar
 then onSuccess 
 else return (Just $ onFailure actualChar)
recurse outcomeM s = f s where
 f s = do
 outcome <- outcomeM
 case outcome of
 Just a -> f (s ++ a)
 Nothing -> return s

Question 9

Perhaps replace the expression recurse (match '\r' haveCR noCR) [] with unfoldM (match '\r' haveCR noCR), where unfoldM is in the monad-loops package.

Question 10

You're trying to read from stdin until "\r\n" is encountered? You could just recursively call hGetChar until you find the right sequence (the buffer is maintained in reverse order and reversed at the end for efficiency reasons and to allow nice pattern-matching):

readHeaderLine :: Handle -> IO String
readHeaderLine h = reverse <$> go ""
 where go :: String -> IO String
 go ('\n':'\r':s) = return s
 go xs = do
 ch <- hGetChar h
 go (ch:xs)

Question 11

I changed go s@('\n':'\r':_) = return s into go s@('\n':'\r':s') = return s' so that it wouldn’t include the \r\n in the result.

Question 12

@RadekSlupik: Ah, good point - I overlooked that the EOL marker is not included in the output. I'll adjust my answer accordingly (you don't need the s@ part anymore in that case).

Question 13

Note as Frerich mentioned that one serious antipattern in that code is the use of ++ to build up a list element by element. That's bad because each time you add an element, ++ traverses the whole list so far, which means building an n-element list takes O(n**2) operations. It's better to use the ByteString snoc function, or use a difference list, or build up the return string in reverse order and then reverse it.

As for recognizing the \r\n, it's pretty easy to code a straightforward state machine, using a datatype to remember what's been seen so far. In the code below I build up the reversed output and then reverse it, and introduce a newtype for reversed lists to keep track of when a list hasn't been reversed yet. "snoc" inserts a new element at the front of a reversed list, which means it will be at the end of the unreversed list. "snoc" is "cons" spelled backwards and "cons" is the traditional name (from Lisp) for putting a new element at the front of a linked list.

import System.IO
-- Seen is the datatype for the current state
-- we don't call it State because that's a commonly used
-- library module
data Seen = SeenNothing | SeenCR
type HostName = String
handleRequest :: HostName -> Handle -> IO ()
handleRequest host handle = do
 requestLine <- readHeaderLine handle
 putStrLn $ requestLine ++ "\n-------------------"
-- below is a datatype we'll use for reversed lists
newtype Reversed a = Reversed a
snoc :: Reversed [a] -> a -> Reversed [a]
snoc (Reversed xs) x = Reversed (x:xs)
-- turn a reversed list back to a normal one
unreverse :: Reversed [a] -> [a]
unreverse (Reversed xs) = reverse xs
readHeaderLine h = do
 r <- f h SeenNothing (Reversed "")
 return $ unreverse r
-- this helper function does the work of building the reversed string
f :: Handle -> Seen -> Reversed String -> IO (Reversed String)
f h state result = do
 c <- hGetChar h
 let r = snoc result c
 go :: Seen -> IO (Reversed String)
 go state = f h state r
 case (state,c) of
 (SeenCR,'\n') -> return r
 (_, '\r') -> go SeenCR
 otherwise -> go SeenNothing

Dan Burton Dan BurtonDan Burton 7051 gold badge5 silver badges9 bronze badges · Accepted Answer · 2012-09-20 01:27:02Z

You can use hGetContents to read lazily from a handle, and the split package (which will hopefully be bundled into the next Haskell Platform release) provides some nice ways of dealing with lists.

import Data.List.Split (splitOn)
import System.IO (hGetContents)
... do
 ...
 contentLines <- splitOn "\r\n" <$> hGetContents handle

contentLines will contain a lazy list of the "entire contents" of that handle, split into chunks that were originally separated by \r\n.

Another approach is to use something like the conduit package. See, for example, Data.Conduit.Binary.lines. Keep an eye on conduit & friends; I get the feeling that within the next year or so the Haskell community will start to agree on the "best" implementation of this sort of abstraction, and then some good tutorials will inevitably follow.

Stack Exchange Network

How to shorten this terrible HTTP header parser?

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to shorten this terrible HTTP header parser?

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions