I wrote a function that recursively walks a given directory.
module WalkDir (walkDir) where
import System.Directory (doesDirectoryExist, getDirectoryContents)
import System.FilePath ((</>))
walkDir :: FilePath -> IO [FilePath]
walkDir r = contents >>= fmap concat . traverse helper
where contents = fmap (r </>) . filter ((&&) . (/=) "." <*> (/=) "..") <$> getDirectoryContents r
helper x = do e <- doesDirectoryExist x
if e then walkDir x else return [x]
However, I have several concerns with this function. For starters, it is slow and it does not print out results until they have all been collected (not as lazy as it should be). My best guess is that this is because of the constant concatenations.
Additionally, the use of do-notation
in helper seems clunky. This is where I'd love it if if
were just a function because I could just use >>=
with no do
required. Alternatively if there were a GHC extension equivalent to LambdaCase for if statements that would also work.
2 Answers 2
As you noticed, this function is slow because it collects all the results before it starts printing them. To circumvent that problem, you need to interleave the collection of information and its printing.
A good way of doing that whilst keeping a compositional approach to solving the problem is to introduce a datatype reifying the structure of walkDir
's call graph. Instead of sequencing all IO
actions and getting a list of FilePath
s back, you'd build a tree describing the computation (RTree
for Rose Tree and T
for Transformer as it takes an m
):
data RTreeT m a = Node a [m (RTreeT m a)]
You can now write walkDir'
describing your strategy to explore the directories on your filesystem: return the files present in the current directory immediately and then explore the subdirectories one after the other.
walkDir' :: FilePath -> IO (RTreeT IO [FilePath])
walkDir' r = do
contents <- fmap (r </>) . exceptLocal <$> getDirectoryContents r
(files, dirs) <- filesAndDirs contents
return $ Node files $ fmap walkDir' dirs
where filesAndDirs
partitions a list of FilePath
depending on whether they are files or directories (using tagDirectories
to perform that test).
tagDirectories :: [FilePath] -> IO [(FilePath, Bool)]
tagDirectories = mapM (\ x -> (x,) <$> doesDirectoryExist x)
filesAndDirs :: [FilePath] -> IO ([FilePath], [FilePath])
filesAndDirs c = bimap (fmap fst) . partition (not . snd) <$> tagDirectories c
where bimap f (a, b) = (f a, f b)
and exceptLocal
is the filter you had in your original code snippet:
exceptLocal :: [FilePath] -> [FilePath]
exceptLocal = filter ((&&) . (/=) "." <*> (/=) "..")
You now have an RTreeT IO [FilePath]
and you can described a strategy to print it which will interleave printing some of the content and running some of the remaining IO
actions:
printRTreeT :: Show a => RTreeT IO a -> IO ()
printRTreeT (Node a mts) = print a >> mapM_ (printRTreeT =<<) mts
Of course, this is a rather crude printing function (e.g. you will notice quite a few empty lists if you have empty subdirectories) but it gives you an idea of how to proceed from there on.
If this is still slow, you may want to play the same sort of trick on filesAndDirs
: rather than sequencing all tests in one go, you could want to have a structure allowing you to only deal with one FilePath
at a time.
-
\$\begingroup\$ Why not make the declaration
data RTree a = Node a [Rtree a]
and change and simply just add the linedirs' <- traverse walkDir dirs
wowalkDir
? \$\endgroup\$chad– chad2016年02月17日 01:20:35 +00:00Commented Feb 17, 2016 at 1:20 -
\$\begingroup\$ Because then you'd be scheduling all the
IO
actions at once thus having a slow program again. \$\endgroup\$gallais– gallais2016年02月17日 05:57:43 +00:00Commented Feb 17, 2016 at 5:57
With LambdaCase extension it is possible to write helper
without do
(but it does not seem much more readable):
helper x = doesDirectoryExist x >>= \case
True -> walkDir x
False -> return [x]
There is listDirectory
function in recent directory
package which can spare you a check for .
and ..
.
listDirectory dir returns a list of all entries in dir without the special entries (. and ..).
It is possible to create cyclic directory structure with symbolic links, so it may be reasonable not to traverse them. E.g. you can use getSymbolicLinkStatus
from unix
package to traverse only real directories:
helper x = getSymbolicLinkStatus x >>= \case
st | isDirectory st -> walkDir x
_ -> return [x]
As lazy IO is considered deprecated, it is better to use iteratees/conduits/pipes to work with IO effectively and in compositional style. Here is an example using pipes:
{-# LANGUAGE LambdaCase #-}
import Pipes
import Pipes.Prelude (stdoutLn)
import System.Directory (listDirectory)
import System.FilePath ((</>))
import System.Posix.Files (getSymbolicLinkStatus, isDirectory)
walkDir :: FilePath -> Producer FilePath IO ()
walkDir path
= lift (getSymbolicLinkStatus path)
>>= \case
st | not $ isDirectory st -> yield path
_ -> lift (listDirectory path) >>= mapM_ (walkDir . (path </>))
Check it with: runEffect $ walkDir "/" >-> stdoutLn
if
with a bind directly, you could just do\e -> if e then walkDir x else return [x]
. That's what thedo
notation desugars to anyway. \$\endgroup\$ifFunction x y z = if x then y else z
. \$\endgroup\$