Loading Files from a Directory in Haskell

Question 1

I'm a very new Haskell programmer, and I've written the following code for loading TSV files. I think it can be improved, but I'm not sure how.

I feel like the giant "do" block in loadData is inelegant, and that there's likely a better way to approach this.
I think I'm trying to avoid the IO monad because I'm not sure how to work with it, but that there's probably a better way to handle mapping parseTSV over the contents of the files.

One note: I'm not super concerned about performance - this will run once at the beginning of my program. I want to load in all the files entirely to build a composite data structure from their contents.

module LanguageMachine (loadData) where
import System.Directory (listDirectory)
import Data.List ((\\), elemIndex)
import Data.Maybe (mapMaybe)
parseTsv :: String -> [(String, Int)]
parseTsv contents = mapMaybe parseLine (lines contents)
parseLine :: String -> Maybe (String, Int)
parseLine line =
 case elemIndex '\t' line of
 Just i ->
 let
 (word, count) = splitAt i line
 in
 Just (word, read count :: Int)
 Nothing -> Nothing
loadData :: FilePath -> [FilePath] -> IO [(String, [(String, Int)])]
loadData path exclude = do
 files <- listDirectory path
 let filtered = files \\ exclude
 let prefixed = map ((path ++ "/") ++) filtered
 contents <- traverse readFile prefixed
 let results = map parseTsv contents
 return $ zip filtered results

The files look like this, which are two-value TSV lines of a word and then the number of occurrences of that word:

ARIA 4308
ORE 4208

Thank you!

Question 2

The loadData function looks just fine to me. I don't see a way it could be any more elegant given what it's doing. I do actually like the way you have separated reading the file and parsing it. That's the way to go.

Your concern about "avoiding the IO monad" shouldn't be a concern. On the contrary, that's actually a "best practice", known as "pushing I/O to the boundary". It means that the majority of your program should be pure, and only a thin shell should deal with external world. This way you can easily test and debug the majority of your program.

The only thing I might do is make the last three lines one expression:

traverse readFile prefixed
 <&> map parseTsv
 <&> zip filtered

I find this a bit more readable, but that's just a matter of personal taste.

But the parseLine function could sure use some improvement.

First, the control flow. Look what it's doing: compute a Maybe value, then if it's Just, compute another Maybe value from its contents. That's exactly what "bind" does! This means it can be nicely represented as a do block.

Second, read is a partial function. This means it can fail at runtime if there is a non-number in the file. I understand this probably Should Never HappenTM, but you'd be surprised how many things that "should never happen" do, in fact, happen all the time. :-)

But since you're already in the Maybe context, it would be easy to use readMaybe instead, and skip the malformed numbers.

Applying both points:

parseLine line = do
 i <- elemIndex '\t' line
 let (word, strCount) = splitAt i line
 count <- readMaybe strCount
 return (word, count)

Question 3

Thank you! The <&> operator is very handy. I'm still getting comfortable thinking in monads, so your feedback is very helpful.

Question 4

Glad I could help!

Fyodor Soikin Fyodor Soikin 7393 silver badges7 bronze badges · Accepted Answer · 2020-03-29 15:32:46Z

The loadData function looks just fine to me. I don't see a way it could be any more elegant given what it's doing. I do actually like the way you have separated reading the file and parsing it. That's the way to go.

Your concern about "avoiding the IO monad" shouldn't be a concern. On the contrary, that's actually a "best practice", known as "pushing I/O to the boundary". It means that the majority of your program should be pure, and only a thin shell should deal with external world. This way you can easily test and debug the majority of your program.

The only thing I might do is make the last three lines one expression:

traverse readFile prefixed
 <&> map parseTsv
 <&> zip filtered

I find this a bit more readable, but that's just a matter of personal taste.

But the parseLine function could sure use some improvement.

First, the control flow. Look what it's doing: compute a Maybe value, then if it's Just, compute another Maybe value from its contents. That's exactly what "bind" does! This means it can be nicely represented as a do block.

Second, read is a partial function. This means it can fail at runtime if there is a non-number in the file. I understand this probably Should Never HappenTM, but you'd be surprised how many things that "should never happen" do, in fact, happen all the time. :-)

But since you're already in the Maybe context, it would be easy to use readMaybe instead, and skip the malformed numbers.

Applying both points:

parseLine line = do
 i <- elemIndex '\t' line
 let (word, strCount) = splitAt i line
 count <- readMaybe strCount
 return (word, count)

Thank you! The <&> operator is very handy. I'm still getting comfortable thinking in monads, so your feedback is very helpful.

Stack Exchange Network

Loading Files from a Directory in Haskell

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Loading Files from a Directory in Haskell

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions