Walking a directory in Haskell

Question 1

I wrote a function that recursively walks a given directory.

module WalkDir (walkDir) where
import System.Directory (doesDirectoryExist, getDirectoryContents)
import System.FilePath ((</>))
walkDir :: FilePath -> IO [FilePath]
walkDir r = contents >>= fmap concat . traverse helper
 where contents = fmap (r </>) . filter ((&&) . (/=) "." <*> (/=) "..") <$> getDirectoryContents r
 helper x = do e <- doesDirectoryExist x
 if e then walkDir x else return [x]

However, I have several concerns with this function. For starters, it is slow and it does not print out results until they have all been collected (not as lazy as it should be). My best guess is that this is because of the constant concatenations.

Additionally, the use of do-notation in helper seems clunky. This is where I'd love it if if were just a function because I could just use >>= with no do required. Alternatively if there were a GHC extension equivalent to LambdaCase for if statements that would also work.

Question 2

Regarding using if with a bind directly, you could just do \e -> if e then walkDir x else return [x]. That's what the do notation desugars to anyway.

Question 3

Yes I realize that is how do desugars. While I generally try to avoid do notation if I can, I get to avoid lambdas even more

Question 4

Why would you avoid either? Also, if you really want an if function in Haskell, you can do it, since Haskell is lazy. ifFunction x y z = if x then y else z.

Question 5

I don't "avoid" them I guess. I find binds more elegant looking than do notation and pointfree style better looking than lambdas, generally. That said, if the code can be expressed more simply or elegantly with do notation and lambdas, I'll use them.

Question 6

As you noticed, this function is slow because it collects all the results before it starts printing them. To circumvent that problem, you need to interleave the collection of information and its printing.

A good way of doing that whilst keeping a compositional approach to solving the problem is to introduce a datatype reifying the structure of walkDir's call graph. Instead of sequencing all IO actions and getting a list of FilePaths back, you'd build a tree describing the computation (RTree for Rose Tree and T for Transformer as it takes an m):

data RTreeT m a = Node a [m (RTreeT m a)]

You can now write walkDir' describing your strategy to explore the directories on your filesystem: return the files present in the current directory immediately and then explore the subdirectories one after the other.

walkDir' :: FilePath -> IO (RTreeT IO [FilePath])
walkDir' r = do
 contents <- fmap (r </>) . exceptLocal <$> getDirectoryContents r
 (files, dirs) <- filesAndDirs contents
 return $ Node files $ fmap walkDir' dirs

where filesAndDirs partitions a list of FilePath depending on whether they are files or directories (using tagDirectories to perform that test).

 tagDirectories :: [FilePath] -> IO [(FilePath, Bool)]
 tagDirectories = mapM (\ x -> (x,) <$> doesDirectoryExist x)
 filesAndDirs :: [FilePath] -> IO ([FilePath], [FilePath])
 filesAndDirs c = bimap (fmap fst) . partition (not . snd) <$> tagDirectories c
 where bimap f (a, b) = (f a, f b)

and exceptLocal is the filter you had in your original code snippet:

 exceptLocal :: [FilePath] -> [FilePath]
 exceptLocal = filter ((&&) . (/=) "." <*> (/=) "..")

You now have an RTreeT IO [FilePath] and you can described a strategy to print it which will interleave printing some of the content and running some of the remaining IO actions:

printRTreeT :: Show a => RTreeT IO a -> IO ()
printRTreeT (Node a mts) = print a >> mapM_ (printRTreeT =<<) mts

Of course, this is a rather crude printing function (e.g. you will notice quite a few empty lists if you have empty subdirectories) but it gives you an idea of how to proceed from there on.

If this is still slow, you may want to play the same sort of trick on filesAndDirs: rather than sequencing all tests in one go, you could want to have a structure allowing you to only deal with one FilePath at a time.

Question 7

Why not make the declaration data RTree a = Node a [Rtree a] and change and simply just add the line dirs' <- traverse walkDir dirs wo walkDir?

Question 8

Because then you'd be scheduling all the IO actions at once thus having a slow program again.

Question 9

With LambdaCase extension it is possible to write helper without do (but it does not seem much more readable):

helper x = doesDirectoryExist x >>= \case
 True -> walkDir x
 False -> return [x]

There is listDirectory function in recent directory package which can spare you a check for . and ...

listDirectory dir returns a list of all entries in dir without the special entries (. and ..).

It is possible to create cyclic directory structure with symbolic links, so it may be reasonable not to traverse them. E.g. you can use getSymbolicLinkStatus from unix package to traverse only real directories:

helper x = getSymbolicLinkStatus x >>= \case
 st | isDirectory st -> walkDir x
 _ -> return [x]

As lazy IO is considered deprecated, it is better to use iteratees/conduits/pipes to work with IO effectively and in compositional style. Here is an example using pipes:

{-# LANGUAGE LambdaCase #-}
import Pipes
import Pipes.Prelude (stdoutLn)
import System.Directory (listDirectory)
import System.FilePath ((</>))
import System.Posix.Files (getSymbolicLinkStatus, isDirectory)
walkDir :: FilePath -> Producer FilePath IO ()
walkDir path
 = lift (getSymbolicLinkStatus path)
 >>= \case
 st | not $ isDirectory st -> yield path
 _ -> lift (listDirectory path) >>= mapM_ (walkDir . (path </>))

Check it with: runEffect $ walkDir "/" >-> stdoutLn

gallais gallais 7814 silver badges12 bronze badges · Accepted Answer · 2016-02-15 14:27:54Z

As you noticed, this function is slow because it collects all the results before it starts printing them. To circumvent that problem, you need to interleave the collection of information and its printing.

A good way of doing that whilst keeping a compositional approach to solving the problem is to introduce a datatype reifying the structure of walkDir's call graph. Instead of sequencing all IO actions and getting a list of FilePaths back, you'd build a tree describing the computation (RTree for Rose Tree and T for Transformer as it takes an m):

data RTreeT m a = Node a [m (RTreeT m a)]

You can now write walkDir' describing your strategy to explore the directories on your filesystem: return the files present in the current directory immediately and then explore the subdirectories one after the other.

walkDir' :: FilePath -> IO (RTreeT IO [FilePath])
walkDir' r = do
 contents <- fmap (r </>) . exceptLocal <$> getDirectoryContents r
 (files, dirs) <- filesAndDirs contents
 return $ Node files $ fmap walkDir' dirs

where filesAndDirs partitions a list of FilePath depending on whether they are files or directories (using tagDirectories to perform that test).

 tagDirectories :: [FilePath] -> IO [(FilePath, Bool)]
 tagDirectories = mapM (\ x -> (x,) <$> doesDirectoryExist x)
 filesAndDirs :: [FilePath] -> IO ([FilePath], [FilePath])
 filesAndDirs c = bimap (fmap fst) . partition (not . snd) <$> tagDirectories c
 where bimap f (a, b) = (f a, f b)

and exceptLocal is the filter you had in your original code snippet:

 exceptLocal :: [FilePath] -> [FilePath]
 exceptLocal = filter ((&&) . (/=) "." <*> (/=) "..")

You now have an RTreeT IO [FilePath] and you can described a strategy to print it which will interleave printing some of the content and running some of the remaining IO actions:

printRTreeT :: Show a => RTreeT IO a -> IO ()
printRTreeT (Node a mts) = print a >> mapM_ (printRTreeT =<<) mts

Of course, this is a rather crude printing function (e.g. you will notice quite a few empty lists if you have empty subdirectories) but it gives you an idea of how to proceed from there on.

If this is still slow, you may want to play the same sort of trick on filesAndDirs: rather than sequencing all tests in one go, you could want to have a structure allowing you to only deal with one FilePath at a time.

Why not make the declaration data RTree a = Node a [Rtree a] and change and simply just add the line dirs' <- traverse walkDir dirs wo walkDir?
Because then you'd be scheduling all the IO actions at once thus having a slow program again.

Stack Exchange Network

Walking a directory in Haskell

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Walking a directory in Haskell

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions