I am leaning Haskell and decided to do the following lab assignment from CS240h: Functional Systems in Haskell. I welcome any advice on more idiomatic coding style and performance tips.
Notes:
- I chose
Data.HashTable
on the assumption that it would be faster thanData.Map
as the number of entries increase. Edit: this seems to be an erroneous assumption as timing on about 5M of input hasData.Map
at 3.5s andData.HashTable
at 5.3s. - Some of the
_ <- mapM
does not feel right. - I also wondered if using some sort of
ByteString
would be better, but I was also confused on how to make it work with UTF-8 and coming up with a suitable hash function, so I dropped that idea.
Here is the code I ended with:
module Lab1 where
import Prelude hiding (lookup, readFile, getContents)
import System.Environment (getArgs)
import System.IO.UTF8 (readFile, getContents)
import Data.Char (isAlpha, toLower, isPunctuation)
import Data.HashTable
import Data.Maybe (fromMaybe)
import Data.Function (on)
import Data.List (sortBy, dropWhileEnd)
type Frequencies = HashTable String Int
-- Get relevant words according to the instructions in the assignment.
-- Convert to lowercase, keep the single quotes ("Don't" -> "don't")
getWords :: String -> [String]
getWords str = let
isGoodChar c = isAlpha(c) || c `elem` "`'"
chunk s = let (word, notGood) = span isGoodChar s
(_, rest) = span (not . isGoodChar) notGood
in word : (if rest == "" then [] else chunk rest)
trim = (dropWhile isPunctuation) . (dropWhileEnd isPunctuation)
lower = map toLower str
in [trimmed | word <- chunk lower, let trimmed = trim word, trimmed /= ""]
emptyFrequencies :: IO Frequencies
emptyFrequencies = fromList hashString []
increaseCount :: Frequencies -> String -> IO Frequencies
increaseCount frequencies word = do
maybeCount <- lookup frequencies word
let count = fromMaybe 0 maybeCount
_ <- update frequencies word (count + 1)
return frequencies
countWords :: Frequencies -> String -> IO Frequencies
countWords frequencies str = do
_ <- mapM (increaseCount frequencies) (getWords str)
return frequencies
-- read list files from arguments or use stdin
processInputs :: IO Frequencies
processInputs = do
frequencies <- emptyFrequencies
args <- getArgs
contents <- if args == [] then sequence [getContents]
else mapM readFile args
_ <- mapM (countWords frequencies) contents
return frequencies
printFrequencies :: IO ()
printFrequencies = do
frequencies <- processInputs
list <- toList frequencies
let sorted = sortBy (compare `on` negate . snd) list
maxLengthWord = maximum $ map (length . fst) sorted
maxLengthBar = 80 - maxLengthWord - 1
maxCount = (snd . head) sorted
baseUnit = (fromIntegral maxLengthBar) / (fromIntegral maxCount) :: Double
makeBar count = replicate (round $ (fromIntegral count) * baseUnit) '#'
pad w = take (maxLengthWord + 1) $ (w ++ cycle " ")
bars = [(pad w) ++ bar | (w, count) <- sorted, let bar = makeBar count, bar /= ""]
_ <- mapM putStrLn bars
return ()
I spent a bit more time and find out I could use foldlM
and mapM_
in a couple places.
Also I ran the code through hlint and it pointed out that break isGoodChar
could replace span (not . isGoodChar)
and that I had some unnecessary parentheses. So here is the updated version:
module Lab1 where
import Prelude hiding (lookup, readFile, getContents)
import System.Environment (getArgs)
import System.IO.UTF8 (readFile, getContents)
import Data.Char (isAlpha, toLower, isPunctuation)
import Data.HashTable
import Data.Maybe (fromMaybe)
import Data.Function (on)
import Data.List (sortBy, dropWhileEnd)
import Data.Foldable (foldlM)
type Frequencies = HashTable String Int
-- Get relevant words according to the instructions in the assignment.
-- Convert to lowercase, keep the single quotes ("Don't" -> "don't")
getWords :: String -> [String]
getWords str = let
isGoodChar c = isAlpha c || c `elem` "`'"
chunk s = let (word, notGood) = span isGoodChar s
(_, rest) = break isGoodChar notGood
in word : (if null rest then [] else chunk rest)
trim = dropWhile isPunctuation . dropWhileEnd isPunctuation
lower = map toLower str
in [trimmed | word <- chunk lower, let trimmed = trim word, trimmed /= ""]
emptyFrequencies :: IO Frequencies
emptyFrequencies = fromList hashString []
increaseCount :: Frequencies -> String -> IO Frequencies
increaseCount frequencies word = do
maybeCount <- lookup frequencies word
let count = fromMaybe 0 maybeCount
_ <- update frequencies word (count + 1)
return frequencies
countWords :: Frequencies -> String -> IO Frequencies
countWords frequencies str =
foldlM increaseCount frequencies (getWords str)
-- read list files from arguments or use stdin
processInputs :: IO Frequencies
processInputs = do
frequencies <- emptyFrequencies
args <- getArgs
contents <- if null args
then sequence [getContents]
else mapM readFile args
foldlM countWords frequencies contents
printFrequencies :: IO ()
printFrequencies = do
frequencies <- processInputs
list <- toList frequencies
let sorted = sortBy (compare `on` negate . snd) list
maxLengthWord = maximum $ map (length . fst) sorted
maxLengthBar = 80 - maxLengthWord - 1
maxCount = (snd . head) sorted
baseUnit = fromIntegral maxLengthBar / fromIntegral maxCount :: Double
makeBar count = replicate (round $ fromIntegral count * baseUnit) '#'
pad w = take (maxLengthWord + 1) (w ++ cycle " ")
bars = [pad w ++ bar | (w, count) <- sorted, let bar = makeBar count, bar /= ""]
mapM_ putStrLn bars
1 Answer 1
Inside do notation, you don't need to use _ <-
to discard the value of a computation. So
_ <- update frequencies word (count + 1)
can be written as update frequencies word (count + 1)
Haskell programmers tend to prefer immutable collections like Data.Map
or Data.IntMap
over hash tables. Data.HashTable
uses mutable references and requires you to work inside the IO monad. Generally, it is good practice to relegate the IO monad to the outer layers of your program and keep the core pure.
To work efficiently with Unicode data, consider using the text package.
-
\$\begingroup\$ I added the
_ <-
because eclipsefp reports a warning if I don't. So I thought that was a best practice. Yes,Data.Map
turned out to be more efficient and also let me parallelize the code. I will try out the text package to see if it speeds up the code. \$\endgroup\$huynhjl– huynhjl2013年02月21日 04:14:42 +00:00Commented Feb 21, 2013 at 4:14 -
\$\begingroup\$ Using the
text
package worked great! It brought down the time from 3.5s to 2.1s. \$\endgroup\$huynhjl– huynhjl2013年02月22日 06:42:35 +00:00Commented Feb 22, 2013 at 6:42 -
1\$\begingroup\$ Handling strings as list of characters (like the String type does) is conceptually elegant but can be inefficient for intensive string processing. In those cases it's better to use
text
. \$\endgroup\$Daniel Díaz Carrete– Daniel Díaz Carrete2013年02月22日 07:34:17 +00:00Commented Feb 22, 2013 at 7:34 -
\$\begingroup\$ Yes, you will get a warning if you use
mapM
and discard its results without being explicit (i.e. writing_ <- mapM ...
). However, instead of usingmapM
, you can usemapM_
which is the same asmapM
except that it ignores the results from the actions. This is a common pattern, and there is alsoforM_
,sequence_
and so on. \$\endgroup\$beta– beta2013年04月05日 16:42:47 +00:00Commented Apr 5, 2013 at 16:42