Haskell grep simplification

Question 1

I am new in haskell and for start I choosed to write simple grep. Now I want to ask if there is some simplier/shorter way to write it. For example if there is any way to avoid recursion.

parseLines :: String -> [String] -> Int -> IO ()
parseLines _ [] _ = return ()
parseLines pattern (x:xs) line = do 
 when (isInfixOf pattern x) $ putStrLn $ (show line) ++ ": " ++ x
 parseLines pattern xs (line+1)
processFile :: String -> String -> IO ()
processFile _ [] = return ()
processFile pattern file = do 
 exists <- doesFileExist file
 if not exists
 then putStrLn $ file ++ ": file does not exists"
 else do
 putStrLn file
 content <- readFile file
 parseLines pattern (lines content) 0
processFiles :: String -> [String] -> IO ()
processFiles _ [] = return ()
processFiles pattern (x:xs) = do
 processFile pattern x
 processFiles pattern xs
main = do 
 args <- getArgs
 processFiles (head args) (tail args)

Question 2

Since you're actually not doing any regexps, I believe this can hardly be called a "grep". This is just some searching utility.

Question 3

regexps are next level

Question 4

I see these areas where your code could be improved:

processFiles can be expressed very simply using mapM_ from Control.Monad:

processFiles :: String -> [String] -> IO ()
processFiles pattern = mapM_ (processFile pattern)

All your functions are in the IO monad. This goes a bit against Haskell's philosophy to keep side effects to minimum.
parseLines requires the whole file to be read into the memory. This could be solved by using lazy IO, but I'd strongly discourage you from doing so.

One possibility to solve 2. and 3. is to use conduits. This may seem as somewhat complex subject, but the idea is actually very intuitive. A conduit is something that reads input a produces output, using some particular monad. This allows to break your program into very small, reusable components, each doing a single particular task. This makes it easier to debug, test and maintain.

For example, your code could be refactored as follows. (First some required imports.)

import Control.Monad
import Control.Monad.IO.Class
import Data.ByteString (unpack)
import Data.Conduit
import qualified Data.Conduit.Binary as C
import qualified Data.Conduit.List as C
import Data.List (isInfixOf)
import System.Environment (getArgs)
import System.Directory (doesFileExist)
import System.IO
sourceFileLines :: (MonadResource m) => FilePath -> Source m String
sourceFileLines file = bracketP (openFile file ReadMode) hClose loop
 where
 loop h = do
 eof <- liftIO (hIsEOF h)
 unless eof (liftIO (hGetLine h) >>= yield >> loop h)

This function takes a file name and creates a Source - a conduit that takes no input, but produces output. It reads a file line by line and sends each line down the pipeline using yield. Using bracketP we ensure that the file will get closed no matter what happens to the pipeline.

numLines :: (Monad m) => Conduit a m (Int, a)
numLines = C.scanl step 1
 where
 step x n = (n + 1, (n, x))

This component built using scanl is very simple. It just sends its input to the output, and keeps the count along the way. Notice that this conduit doesn't need any IO, it works with any monad.

Now it's easy to filter a stream of numbered lines with a pattern:

parseLines :: (Monad m) => String -> Conduit String m (Int, String)
parseLines pattern = numLines =$= C.filter f
 where
 f (_, x) = isInfixOf pattern x

This function fuses two conduits together. The first one numbers lines, the second filters them according to the pattern.

printMatch :: (MonadIO m) => Sink (Int, String) m ()
printMatch = C.mapM_ (\(n, x) -> liftIO $ putStrLn $ show n ++ ": " ++ x)

In printMatch we separate the logic that prints out the output. For each pair it receives it prints the line number and its content.

Combining and running these conduits is then easy:

runResourceT $ sourceFileLines file $= parseLines pattern $$ printMatch

(runResourceT is needed because of bracketP.) So the rest of the program would look like

processFile :: String -> String -> IO ()
processFile _ [] = return ()
processFile pattern file = do 
 exists <- doesFileExist file
 if not exists
 then putStrLn $ file ++ ": file does not exists"
 else do
 putStrLn file
 runResourceT $ sourceFileLines file $= parseLines pattern $$ printMatch
processFiles :: String -> [String] -> IO ()
processFiles pattern = mapM_ (processFile pattern)
main = do 
 args <- getArgs
 processFiles (head args) (tail args)

Question 5

Here's the updated code. The notes are following after.

import Control.Monad
import Data.List
import System.Directory
import System.Environment
search :: String -> String -> [(Int, String)]
search searchString content = do
 (lineNumber, lineText) <- zip [0..] $ lines content
 if isInfixOf searchString lineText
 then return (lineNumber, lineText)
 else mzero
processFile :: String -> String -> IO ()
processFile searchString file = do 
 exists <- doesFileExist file
 if not exists
 then error $ file ++ ": file does not exist"
 else do
 putStrLn file
 content <- readFile file
 forM_ (search searchString content) $ \(lineNumber, lineText) -> do
 putStrLn $ show lineNumber ++ ": " ++ lineText
main :: IO ()
main = do
 args <- getArgs
 case args of
 searchString : fileNames | not $ null fileNames -> do
 forM_ fileNames $ processFile searchString
 _ -> error "Not enough arguments"

Notes:

In Haskell it's idiomatic to isolate pure code from IO-interactions as much as possible. So first we isolate the search function by accumulating most of the non-IO logic in it. In its implementation I'm utilizing a Monad and MonadPlus instances for list, so don't be surprised by the do-notation used in a non-IO context. Alternatively a List Comprehension syntax could be used, but I'm just not a fan of it. This could also be solved using map and filter and whatnot.
forM_ helps us loop in monads without recursion.

Petr Petr 3,06018 silver badges33 bronze badges · Accepted Answer · 2013-09-25 20:06:35Z

I see these areas where your code could be improved:

processFiles can be expressed very simply using mapM_ from Control.Monad:

processFiles :: String -> [String] -> IO ()
processFiles pattern = mapM_ (processFile pattern)

All your functions are in the IO monad. This goes a bit against Haskell's philosophy to keep side effects to minimum.
parseLines requires the whole file to be read into the memory. This could be solved by using lazy IO, but I'd strongly discourage you from doing so.

One possibility to solve 2. and 3. is to use conduits. This may seem as somewhat complex subject, but the idea is actually very intuitive. A conduit is something that reads input a produces output, using some particular monad. This allows to break your program into very small, reusable components, each doing a single particular task. This makes it easier to debug, test and maintain.

For example, your code could be refactored as follows. (First some required imports.)

import Control.Monad
import Control.Monad.IO.Class
import Data.ByteString (unpack)
import Data.Conduit
import qualified Data.Conduit.Binary as C
import qualified Data.Conduit.List as C
import Data.List (isInfixOf)
import System.Environment (getArgs)
import System.Directory (doesFileExist)
import System.IO
sourceFileLines :: (MonadResource m) => FilePath -> Source m String
sourceFileLines file = bracketP (openFile file ReadMode) hClose loop
 where
 loop h = do
 eof <- liftIO (hIsEOF h)
 unless eof (liftIO (hGetLine h) >>= yield >> loop h)

This function takes a file name and creates a Source - a conduit that takes no input, but produces output. It reads a file line by line and sends each line down the pipeline using yield. Using bracketP we ensure that the file will get closed no matter what happens to the pipeline.

numLines :: (Monad m) => Conduit a m (Int, a)
numLines = C.scanl step 1
 where
 step x n = (n + 1, (n, x))

This component built using scanl is very simple. It just sends its input to the output, and keeps the count along the way. Notice that this conduit doesn't need any IO, it works with any monad.

Now it's easy to filter a stream of numbered lines with a pattern:

parseLines :: (Monad m) => String -> Conduit String m (Int, String)
parseLines pattern = numLines =$= C.filter f
 where
 f (_, x) = isInfixOf pattern x

This function fuses two conduits together. The first one numbers lines, the second filters them according to the pattern.

printMatch :: (MonadIO m) => Sink (Int, String) m ()
printMatch = C.mapM_ (\(n, x) -> liftIO $ putStrLn $ show n ++ ": " ++ x)

In printMatch we separate the logic that prints out the output. For each pair it receives it prints the line number and its content.

Combining and running these conduits is then easy:

runResourceT $ sourceFileLines file $= parseLines pattern $$ printMatch

(runResourceT is needed because of bracketP.) So the rest of the program would look like

processFile :: String -> String -> IO ()
processFile _ [] = return ()
processFile pattern file = do 
 exists <- doesFileExist file
 if not exists
 then putStrLn $ file ++ ": file does not exists"
 else do
 putStrLn file
 runResourceT $ sourceFileLines file $= parseLines pattern $$ printMatch
processFiles :: String -> [String] -> IO ()
processFiles pattern = mapM_ (processFile pattern)
main = do 
 args <- getArgs
 processFiles (head args) (tail args)

Stack Exchange Network

Haskell grep simplification

2 Answers 2

Notes:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Haskell grep simplification

2 Answers 2

Notes:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions