I'm a very new Haskell programmer, and I've written the following code for loading TSV files. I think it can be improved, but I'm not sure how.
- I feel like the giant "do" block in loadData is inelegant, and that there's likely a better way to approach this.
- I think I'm trying to avoid the IO monad because I'm not sure how to work with it, but that there's probably a better way to handle mapping
parseTSV
over the contents of the files.
One note: I'm not super concerned about performance - this will run once at the beginning of my program. I want to load in all the files entirely to build a composite data structure from their contents.
module LanguageMachine (loadData) where
import System.Directory (listDirectory)
import Data.List ((\\), elemIndex)
import Data.Maybe (mapMaybe)
parseTsv :: String -> [(String, Int)]
parseTsv contents = mapMaybe parseLine (lines contents)
parseLine :: String -> Maybe (String, Int)
parseLine line =
case elemIndex '\t' line of
Just i ->
let
(word, count) = splitAt i line
in
Just (word, read count :: Int)
Nothing -> Nothing
loadData :: FilePath -> [FilePath] -> IO [(String, [(String, Int)])]
loadData path exclude = do
files <- listDirectory path
let filtered = files \\ exclude
let prefixed = map ((path ++ "/") ++) filtered
contents <- traverse readFile prefixed
let results = map parseTsv contents
return $ zip filtered results
The files look like this, which are two-value TSV lines of a word and then the number of occurrences of that word:
ARIA 4308
ORE 4208
Thank you!
1 Answer 1
The loadData
function looks just fine to me. I don't see a way it could be any more elegant given what it's doing. I do actually like the way you have separated reading the file and parsing it. That's the way to go.
Your concern about "avoiding the IO monad" shouldn't be a concern. On the contrary, that's actually a "best practice", known as "pushing I/O to the boundary". It means that the majority of your program should be pure, and only a thin shell should deal with external world. This way you can easily test and debug the majority of your program.
The only thing I might do is make the last three lines one expression:
traverse readFile prefixed
<&> map parseTsv
<&> zip filtered
I find this a bit more readable, but that's just a matter of personal taste.
But the parseLine
function could sure use some improvement.
First, the control flow. Look what it's doing: compute a Maybe
value, then if it's Just
, compute another Maybe
value from its contents. That's exactly what "bind" does! This means it can be nicely represented as a do
block.
Second, read
is a partial function. This means it can fail at runtime if there is a non-number in the file. I understand this probably Should Never HappenTM, but you'd be surprised how many things that "should never happen" do, in fact, happen all the time. :-)
But since you're already in the Maybe
context, it would be easy to use readMaybe
instead, and skip the malformed numbers.
Applying both points:
parseLine line = do
i <- elemIndex '\t' line
let (word, strCount) = splitAt i line
count <- readMaybe strCount
return (word, count)
-
\$\begingroup\$ Thank you! The <&> operator is very handy. I'm still getting comfortable thinking in monads, so your feedback is very helpful. \$\endgroup\$Erty Seidohl– Erty Seidohl2020年03月29日 20:44:26 +00:00Commented Mar 29, 2020 at 20:44
-
\$\begingroup\$ Glad I could help! \$\endgroup\$Fyodor Soikin– Fyodor Soikin2020年03月30日 07:42:52 +00:00Commented Mar 30, 2020 at 7:42