4
\$\begingroup\$

I'm a very new Haskell programmer, and I've written the following code for loading TSV files. I think it can be improved, but I'm not sure how.

  1. I feel like the giant "do" block in loadData is inelegant, and that there's likely a better way to approach this.
  2. I think I'm trying to avoid the IO monad because I'm not sure how to work with it, but that there's probably a better way to handle mapping parseTSV over the contents of the files.

One note: I'm not super concerned about performance - this will run once at the beginning of my program. I want to load in all the files entirely to build a composite data structure from their contents.

module LanguageMachine (loadData) where
import System.Directory (listDirectory)
import Data.List ((\\), elemIndex)
import Data.Maybe (mapMaybe)
parseTsv :: String -> [(String, Int)]
parseTsv contents = mapMaybe parseLine (lines contents)
parseLine :: String -> Maybe (String, Int)
parseLine line =
 case elemIndex '\t' line of
 Just i ->
 let
 (word, count) = splitAt i line
 in
 Just (word, read count :: Int)
 Nothing -> Nothing
loadData :: FilePath -> [FilePath] -> IO [(String, [(String, Int)])]
loadData path exclude = do
 files <- listDirectory path
 let filtered = files \\ exclude
 let prefixed = map ((path ++ "/") ++) filtered
 contents <- traverse readFile prefixed
 let results = map parseTsv contents
 return $ zip filtered results

The files look like this, which are two-value TSV lines of a word and then the number of occurrences of that word:

ARIA 4308
ORE 4208

Thank you!

asked Mar 29, 2020 at 12:06
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

The loadData function looks just fine to me. I don't see a way it could be any more elegant given what it's doing. I do actually like the way you have separated reading the file and parsing it. That's the way to go.

Your concern about "avoiding the IO monad" shouldn't be a concern. On the contrary, that's actually a "best practice", known as "pushing I/O to the boundary". It means that the majority of your program should be pure, and only a thin shell should deal with external world. This way you can easily test and debug the majority of your program.

The only thing I might do is make the last three lines one expression:

traverse readFile prefixed
 <&> map parseTsv
 <&> zip filtered

I find this a bit more readable, but that's just a matter of personal taste.


But the parseLine function could sure use some improvement.

First, the control flow. Look what it's doing: compute a Maybe value, then if it's Just, compute another Maybe value from its contents. That's exactly what "bind" does! This means it can be nicely represented as a do block.

Second, read is a partial function. This means it can fail at runtime if there is a non-number in the file. I understand this probably Should Never HappenTM, but you'd be surprised how many things that "should never happen" do, in fact, happen all the time. :-)

But since you're already in the Maybe context, it would be easy to use readMaybe instead, and skip the malformed numbers.

Applying both points:

parseLine line = do
 i <- elemIndex '\t' line
 let (word, strCount) = splitAt i line
 count <- readMaybe strCount
 return (word, count)
answered Mar 29, 2020 at 15:32
\$\endgroup\$
2
  • \$\begingroup\$ Thank you! The <&> operator is very handy. I'm still getting comfortable thinking in monads, so your feedback is very helpful. \$\endgroup\$ Commented Mar 29, 2020 at 20:44
  • \$\begingroup\$ Glad I could help! \$\endgroup\$ Commented Mar 30, 2020 at 7:42

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.