3
\$\begingroup\$

I've written a parser with the help of some experienced Haskellers. The parser is not bad, but the same kind of parser written in PHP7 with regular expressions is much faster (takes 25% of the time that takes the Haskell one).

This is a subset of the file I want to parse: lpaste link

The main objective of the parser is to get three kinds of errors:

1) Error with backtrace (multiple lines, the first of them is like 3))
2) Single error with one more line
3) Single line error

Here is my current code:

{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString as B
import Data.ByteString.Char8 as B8 hiding (lines, filter, unlines, head, readFile, take, length,
 putStrLn, tail, map, concat, or, writeFile, intersperse,
 groupBy, hGetContents)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TIO
import Data.Attoparsec.Text.Lazy hiding (take)
import Control.Applicative
import Control.Monad (replicateM, mapM, forM, forM_)
import Data.Either (either)
import Data.List (intersperse, groupBy)
import System.Environment
import qualified System.IO as SIO
data TimeStamp = MkTimeStamp T.Text
 deriving Show
data LogFileInfo = BackTraceLineInfo T.Text
 | BackTraceInfo TimeStamp T.Text T.Text [LogFileInfo]
 | Error TimeStamp T.Text
 | LargeError TimeStamp T.Text T.Text
 deriving Show
data LineType = SingleLineError TimeStamp T.Text
 | DirectoryInfo T.Text
 | ErrorInfo T.Text
 | LineBackTraceInfo T.Text
 | BackTraceString T.Text
 | BackTraceLine T.Text
 deriving Show
parseTimeStamp :: Parser TimeStamp
parseTimeStamp = do
 year <- many digit
 char '-'
 month <- many digit
 char '-'
 day <- many digit
 char ' '
 hour <- many digit
 char ':'
 minute <- many digit
 char ':'
 second <- many digit
 char ' '
 (return . MkTimeStamp) $ T.pack $ year ++ "-" ++ month ++ "-" ++ day ++ " " ++ hour ++ ":" ++ minute ++ ":" ++ second
parseError :: Parser LineType
parseError = do
 string $ "ERROR - "
 timeStamp <- parseTimeStamp
 errorInfo <- parseAnyLine
 return $ SingleLineError timeStamp errorInfo
parseDirectoryInfo :: Parser LineType
parseDirectoryInfo = do
 char '/'
 directoryInfo <- parseAnyLine
 (return . DirectoryInfo) $ T.append "/" directoryInfo
parseErrorInfo :: Parser LineType
parseErrorInfo = do
 errorInfo <- parseAnyLine
 (return . ErrorInfo) errorInfo
parseBackTraceString :: Parser LineType
parseBackTraceString = do
 let backTraceStr = " Backtrace: "
 string backTraceStr
 return $ BackTraceString $ T.fromStrict backTraceStr
parseBacktraceLine :: Parser LineType
parseBacktraceLine = do
 char '#'
 number <- many1 digit
 backTraceInfo <- parseAnyLine
 let numberPart = '#' : number
 return $ LineBackTraceInfo $ T.append (T.pack numberPart) backTraceInfo
parseAnyLine :: Parser T.Text
parseAnyLine = do
 lineStr <- many anyChar
 return $ T.pack lineStr
-- Skips n lines for allowing other parsers to succeed
skipNLines n = replicateM n $ manyTill anyChar endOfLine
-- performParser :: Parser a -> T.Text -> BackTraceInfo
performParser = parseOnly
getEitherRight :: Either a b -> b
getEitherRight (Right b) = b
-- try no sirve con attoparsec
parseLogLine :: Parser LineType
parseLogLine = parseError
 <|> parseDirectoryInfo
 <|> parseBacktraceLine
 <|> parseBackTraceString
 <|> parseErrorInfo
main = do
 (fileName : _) <- getArgs
 h <- SIO.openFile fileName SIO.ReadMode
 SIO.hSetEncoding h SIO.latin1
 fileContents <- SIO.hGetContents h
 let titleLength = length fileName
 titleWithoutExtension = take (titleLength - 4) fileName
 allNonEmptyLines = map (T.pack) $ tail $ filter (/= "") $ lines fileContents -- [T.Text]
 stringList = fmap (\x -> case eitherResult (parse parseLogLine x) of
 Left e -> return $ show e
 Right a -> return $ show a) allNonEmptyLines
 -- TODO: list processing
 h <- SIO.openFile "errorSummary.txt" SIO.WriteMode
 forM_ stringList (\x -> hPutStrLn h $ B8.pack $ unlines x)
 SIO.hClose h

Is there a way to improve the performance of my code?? (preferably with attoparsec, but regular expressions are allowed).

asked Nov 10, 2015 at 14:17
\$\endgroup\$
0

1 Answer 1

1
\$\begingroup\$

Here are some points:

  • You are reading file as String and then turn it into Text. Reading file directly into Text using Data.Text.IO module is much faster. If your logs contain only ASCII characters consider using ByteString instead of Text.
  • You are actually parsing file twice: first when filtering out empty lines and second in attoparsec parser itself. Handling empty lines withing attoparsec should give you some improvement.
  • Attoparsec documentation advises using Text-oriented parsers whenever possible. For you this means replacing

    year <- many digit
    char '-'
    

    with takeTill (=='-'), for example. This would also eliminate packing and concatenation of Strings.

Finally, you can play around with GHC's profiler to see if it can give you useful insights on where performance bottlenecks are.

answered Nov 17, 2015 at 9:10
\$\endgroup\$
2
  • \$\begingroup\$ I will wait to implement your changes and see the results and then I may accept your answer. Sorry but I'm very busy in my job. \$\endgroup\$ Commented Nov 26, 2015 at 17:46
  • \$\begingroup\$ The main problem was the function parseAnyLine, which I've changed to takeText, my old code takes 155 secs and the new one 34 secs, so I think is a incredible speed up :D, thanks bro. \$\endgroup\$ Commented Dec 12, 2015 at 14:47

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.