6
\$\begingroup\$

I am leaning Haskell and decided to do the following lab assignment from CS240h: Functional Systems in Haskell. I welcome any advice on more idiomatic coding style and performance tips.

Notes:

  • I chose Data.HashTable on the assumption that it would be faster than Data.Map as the number of entries increase. Edit: this seems to be an erroneous assumption as timing on about 5M of input has Data.Map at 3.5s and Data.HashTable at 5.3s.
  • Some of the _ <- mapM does not feel right.
  • I also wondered if using some sort of ByteString would be better, but I was also confused on how to make it work with UTF-8 and coming up with a suitable hash function, so I dropped that idea.

Here is the code I ended with:

module Lab1 where 
import Prelude hiding (lookup, readFile, getContents)
import System.Environment (getArgs)
import System.IO.UTF8 (readFile, getContents)
import Data.Char (isAlpha, toLower, isPunctuation)
import Data.HashTable
import Data.Maybe (fromMaybe)
import Data.Function (on)
import Data.List (sortBy, dropWhileEnd)
type Frequencies = HashTable String Int
-- Get relevant words according to the instructions in the assignment.
-- Convert to lowercase, keep the single quotes ("Don't" -> "don't")
getWords :: String -> [String]
getWords str = let
 isGoodChar c = isAlpha(c) || c `elem` "`'"
 chunk s = let (word, notGood) = span isGoodChar s
 (_, rest) = span (not . isGoodChar) notGood 
 in word : (if rest == "" then [] else chunk rest)
 trim = (dropWhile isPunctuation) . (dropWhileEnd isPunctuation)
 lower = map toLower str
 in [trimmed | word <- chunk lower, let trimmed = trim word, trimmed /= ""]
emptyFrequencies :: IO Frequencies
emptyFrequencies = fromList hashString []
increaseCount :: Frequencies -> String -> IO Frequencies
increaseCount frequencies word = do
 maybeCount <- lookup frequencies word
 let count = fromMaybe 0 maybeCount
 _ <- update frequencies word (count + 1)
 return frequencies
countWords :: Frequencies -> String -> IO Frequencies
countWords frequencies str = do
 _ <- mapM (increaseCount frequencies) (getWords str) 
 return frequencies
-- read list files from arguments or use stdin
processInputs :: IO Frequencies
processInputs = do
 frequencies <- emptyFrequencies
 args <- getArgs
 contents <- if args == [] then sequence [getContents]
 else mapM readFile args
 _ <- mapM (countWords frequencies) contents
 return frequencies
printFrequencies :: IO ()
printFrequencies = do
 frequencies <- processInputs
 list <- toList frequencies
 let sorted = sortBy (compare `on` negate . snd) list
 maxLengthWord = maximum $ map (length . fst) sorted
 maxLengthBar = 80 - maxLengthWord - 1
 maxCount = (snd . head) sorted
 baseUnit = (fromIntegral maxLengthBar) / (fromIntegral maxCount) :: Double
 makeBar count = replicate (round $ (fromIntegral count) * baseUnit) '#'
 pad w = take (maxLengthWord + 1) $ (w ++ cycle " ")
 bars = [(pad w) ++ bar | (w, count) <- sorted, let bar = makeBar count, bar /= ""]
 _ <- mapM putStrLn bars 
 return ()

I spent a bit more time and find out I could use foldlM and mapM_ in a couple places.

Also I ran the code through hlint and it pointed out that break isGoodChar could replace span (not . isGoodChar) and that I had some unnecessary parentheses. So here is the updated version:

module Lab1 where 
import Prelude hiding (lookup, readFile, getContents)
import System.Environment (getArgs)
import System.IO.UTF8 (readFile, getContents)
import Data.Char (isAlpha, toLower, isPunctuation)
import Data.HashTable
import Data.Maybe (fromMaybe)
import Data.Function (on)
import Data.List (sortBy, dropWhileEnd)
import Data.Foldable (foldlM)
type Frequencies = HashTable String Int
-- Get relevant words according to the instructions in the assignment.
-- Convert to lowercase, keep the single quotes ("Don't" -> "don't")
getWords :: String -> [String]
getWords str = let
 isGoodChar c = isAlpha c || c `elem` "`'"
 chunk s = let (word, notGood) = span isGoodChar s
 (_, rest) = break isGoodChar notGood 
 in word : (if null rest then [] else chunk rest)
 trim = dropWhile isPunctuation . dropWhileEnd isPunctuation
 lower = map toLower str
 in [trimmed | word <- chunk lower, let trimmed = trim word, trimmed /= ""]
emptyFrequencies :: IO Frequencies
emptyFrequencies = fromList hashString []
increaseCount :: Frequencies -> String -> IO Frequencies
increaseCount frequencies word = do
 maybeCount <- lookup frequencies word
 let count = fromMaybe 0 maybeCount
 _ <- update frequencies word (count + 1)
 return frequencies
countWords :: Frequencies -> String -> IO Frequencies
countWords frequencies str =
 foldlM increaseCount frequencies (getWords str)
-- read list files from arguments or use stdin
processInputs :: IO Frequencies
processInputs = do
 frequencies <- emptyFrequencies
 args <- getArgs
 contents <- if null args
 then sequence [getContents]
 else mapM readFile args
 foldlM countWords frequencies contents
printFrequencies :: IO ()
printFrequencies = do
 frequencies <- processInputs
 list <- toList frequencies
 let sorted = sortBy (compare `on` negate . snd) list
 maxLengthWord = maximum $ map (length . fst) sorted
 maxLengthBar = 80 - maxLengthWord - 1
 maxCount = (snd . head) sorted
 baseUnit = fromIntegral maxLengthBar / fromIntegral maxCount :: Double
 makeBar count = replicate (round $ fromIntegral count * baseUnit) '#'
 pad w = take (maxLengthWord + 1) (w ++ cycle " ")
 bars = [pad w ++ bar | (w, count) <- sorted, let bar = makeBar count, bar /= ""]
 mapM_ putStrLn bars
asked Feb 14, 2013 at 15:21
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

Inside do notation, you don't need to use _ <- to discard the value of a computation. So _ <- update frequencies word (count + 1) can be written as update frequencies word (count + 1)

Haskell programmers tend to prefer immutable collections like Data.Map or Data.IntMap over hash tables. Data.HashTable uses mutable references and requires you to work inside the IO monad. Generally, it is good practice to relegate the IO monad to the outer layers of your program and keep the core pure.

To work efficiently with Unicode data, consider using the text package.

answered Feb 20, 2013 at 21:32
\$\endgroup\$
4
  • \$\begingroup\$ I added the _ <- because eclipsefp reports a warning if I don't. So I thought that was a best practice. Yes, Data.Map turned out to be more efficient and also let me parallelize the code. I will try out the text package to see if it speeds up the code. \$\endgroup\$ Commented Feb 21, 2013 at 4:14
  • \$\begingroup\$ Using the text package worked great! It brought down the time from 3.5s to 2.1s. \$\endgroup\$ Commented Feb 22, 2013 at 6:42
  • 1
    \$\begingroup\$ Handling strings as list of characters (like the String type does) is conceptually elegant but can be inefficient for intensive string processing. In those cases it's better to use text. \$\endgroup\$ Commented Feb 22, 2013 at 7:34
  • \$\begingroup\$ Yes, you will get a warning if you use mapM and discard its results without being explicit (i.e. writing _ <- mapM ...). However, instead of using mapM, you can use mapM_ which is the same as mapM except that it ignores the results from the actions. This is a common pattern, and there is also forM_, sequence_ and so on. \$\endgroup\$ Commented Apr 5, 2013 at 16:42

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.