Counting word frequencies in Haskell

Question 1

I am leaning Haskell and decided to do the following lab assignment from CS240h: Functional Systems in Haskell. I welcome any advice on more idiomatic coding style and performance tips.

Notes:

I chose Data.HashTable on the assumption that it would be faster than Data.Map as the number of entries increase. Edit: this seems to be an erroneous assumption as timing on about 5M of input has Data.Map at 3.5s and Data.HashTable at 5.3s.
Some of the _ <- mapM does not feel right.
I also wondered if using some sort of ByteString would be better, but I was also confused on how to make it work with UTF-8 and coming up with a suitable hash function, so I dropped that idea.

Here is the code I ended with:

module Lab1 where 
import Prelude hiding (lookup, readFile, getContents)
import System.Environment (getArgs)
import System.IO.UTF8 (readFile, getContents)
import Data.Char (isAlpha, toLower, isPunctuation)
import Data.HashTable
import Data.Maybe (fromMaybe)
import Data.Function (on)
import Data.List (sortBy, dropWhileEnd)
type Frequencies = HashTable String Int
-- Get relevant words according to the instructions in the assignment.
-- Convert to lowercase, keep the single quotes ("Don't" -> "don't")
getWords :: String -> [String]
getWords str = let
 isGoodChar c = isAlpha(c) || c `elem` "`'"
 chunk s = let (word, notGood) = span isGoodChar s
 (_, rest) = span (not . isGoodChar) notGood 
 in word : (if rest == "" then [] else chunk rest)
 trim = (dropWhile isPunctuation) . (dropWhileEnd isPunctuation)
 lower = map toLower str
 in [trimmed | word <- chunk lower, let trimmed = trim word, trimmed /= ""]
emptyFrequencies :: IO Frequencies
emptyFrequencies = fromList hashString []
increaseCount :: Frequencies -> String -> IO Frequencies
increaseCount frequencies word = do
 maybeCount <- lookup frequencies word
 let count = fromMaybe 0 maybeCount
 _ <- update frequencies word (count + 1)
 return frequencies
countWords :: Frequencies -> String -> IO Frequencies
countWords frequencies str = do
 _ <- mapM (increaseCount frequencies) (getWords str) 
 return frequencies
-- read list files from arguments or use stdin
processInputs :: IO Frequencies
processInputs = do
 frequencies <- emptyFrequencies
 args <- getArgs
 contents <- if args == [] then sequence [getContents]
 else mapM readFile args
 _ <- mapM (countWords frequencies) contents
 return frequencies
printFrequencies :: IO ()
printFrequencies = do
 frequencies <- processInputs
 list <- toList frequencies
 let sorted = sortBy (compare `on` negate . snd) list
 maxLengthWord = maximum $ map (length . fst) sorted
 maxLengthBar = 80 - maxLengthWord - 1
 maxCount = (snd . head) sorted
 baseUnit = (fromIntegral maxLengthBar) / (fromIntegral maxCount) :: Double
 makeBar count = replicate (round $ (fromIntegral count) * baseUnit) '#'
 pad w = take (maxLengthWord + 1) $ (w ++ cycle " ")
 bars = [(pad w) ++ bar | (w, count) <- sorted, let bar = makeBar count, bar /= ""]
 _ <- mapM putStrLn bars 
 return ()

I spent a bit more time and find out I could use foldlM and mapM_ in a couple places.

Also I ran the code through hlint and it pointed out that break isGoodChar could replace span (not . isGoodChar) and that I had some unnecessary parentheses. So here is the updated version:

module Lab1 where 
import Prelude hiding (lookup, readFile, getContents)
import System.Environment (getArgs)
import System.IO.UTF8 (readFile, getContents)
import Data.Char (isAlpha, toLower, isPunctuation)
import Data.HashTable
import Data.Maybe (fromMaybe)
import Data.Function (on)
import Data.List (sortBy, dropWhileEnd)
import Data.Foldable (foldlM)
type Frequencies = HashTable String Int
-- Get relevant words according to the instructions in the assignment.
-- Convert to lowercase, keep the single quotes ("Don't" -> "don't")
getWords :: String -> [String]
getWords str = let
 isGoodChar c = isAlpha c || c `elem` "`'"
 chunk s = let (word, notGood) = span isGoodChar s
 (_, rest) = break isGoodChar notGood 
 in word : (if null rest then [] else chunk rest)
 trim = dropWhile isPunctuation . dropWhileEnd isPunctuation
 lower = map toLower str
 in [trimmed | word <- chunk lower, let trimmed = trim word, trimmed /= ""]
emptyFrequencies :: IO Frequencies
emptyFrequencies = fromList hashString []
increaseCount :: Frequencies -> String -> IO Frequencies
increaseCount frequencies word = do
 maybeCount <- lookup frequencies word
 let count = fromMaybe 0 maybeCount
 _ <- update frequencies word (count + 1)
 return frequencies
countWords :: Frequencies -> String -> IO Frequencies
countWords frequencies str =
 foldlM increaseCount frequencies (getWords str)
-- read list files from arguments or use stdin
processInputs :: IO Frequencies
processInputs = do
 frequencies <- emptyFrequencies
 args <- getArgs
 contents <- if null args
 then sequence [getContents]
 else mapM readFile args
 foldlM countWords frequencies contents
printFrequencies :: IO ()
printFrequencies = do
 frequencies <- processInputs
 list <- toList frequencies
 let sorted = sortBy (compare `on` negate . snd) list
 maxLengthWord = maximum $ map (length . fst) sorted
 maxLengthBar = 80 - maxLengthWord - 1
 maxCount = (snd . head) sorted
 baseUnit = fromIntegral maxLengthBar / fromIntegral maxCount :: Double
 makeBar count = replicate (round $ fromIntegral count * baseUnit) '#'
 pad w = take (maxLengthWord + 1) (w ++ cycle " ")
 bars = [pad w ++ bar | (w, count) <- sorted, let bar = makeBar count, bar /= ""]
 mapM_ putStrLn bars

Question 2

Inside do notation, you don't need to use _ <- to discard the value of a computation. So _ <- update frequencies word (count + 1) can be written as update frequencies word (count + 1)

Haskell programmers tend to prefer immutable collections like Data.Map or Data.IntMap over hash tables. Data.HashTable uses mutable references and requires you to work inside the IO monad. Generally, it is good practice to relegate the IO monad to the outer layers of your program and keep the core pure.

To work efficiently with Unicode data, consider using the text package.

Question 3

I added the _ <- because eclipsefp reports a warning if I don't. So I thought that was a best practice. Yes, Data.Map turned out to be more efficient and also let me parallelize the code. I will try out the text package to see if it speeds up the code.

Question 4

Using the text package worked great! It brought down the time from 3.5s to 2.1s.

Question 5

Handling strings as list of characters (like the String type does) is conceptually elegant but can be inefficient for intensive string processing. In those cases it's better to use text.

Question 6

Yes, you will get a warning if you use mapM and discard its results without being explicit (i.e. writing _ <- mapM ...). However, instead of using mapM, you can use mapM_ which is the same as mapM except that it ignores the results from the actions. This is a common pattern, and there is also forM_, sequence_ and so on.

Daniel Díaz Carrete Daniel Díaz Carrete 4282 silver badges10 bronze badges · Accepted Answer · 2013-02-20 21:32:27Z

2

\$\begingroup\$

Inside do notation, you don't need to use _ <- to discard the value of a computation. So _ <- update frequencies word (count + 1) can be written as update frequencies word (count + 1)

Haskell programmers tend to prefer immutable collections like Data.Map or Data.IntMap over hash tables. Data.HashTable uses mutable references and requires you to work inside the IO monad. Generally, it is good practice to relegate the IO monad to the outer layers of your program and keep the core pure.

To work efficiently with Unicode data, consider using the text package.

Share

answered Feb 20, 2013 at 21:32

Daniel Díaz Carrete's user avatar

Daniel Díaz Carrete Daniel Díaz Carrete

4282 silver badges10 bronze badges

\$\endgroup\$

4

\$\begingroup\$ I added the _ <- because eclipsefp reports a warning if I don't. So I thought that was a best practice. Yes, Data.Map turned out to be more efficient and also let me parallelize the code. I will try out the text package to see if it speeds up the code. \$\endgroup\$

huynhjl
– huynhjl

2013年02月21日 04:14:42 +00:00
Commented Feb 21, 2013 at 4:14
\$\begingroup\$ Using the text package worked great! It brought down the time from 3.5s to 2.1s. \$\endgroup\$

huynhjl
– huynhjl

2013年02月22日 06:42:35 +00:00
Commented Feb 22, 2013 at 6:42
1

\$\begingroup\$ Handling strings as list of characters (like the String type does) is conceptually elegant but can be inefficient for intensive string processing. In those cases it's better to use text. \$\endgroup\$

Daniel Díaz Carrete
– Daniel Díaz Carrete

2013年02月22日 07:34:17 +00:00
Commented Feb 22, 2013 at 7:34
\$\begingroup\$ Yes, you will get a warning if you use mapM and discard its results without being explicit (i.e. writing _ <- mapM ...). However, instead of using mapM, you can use mapM_ which is the same as mapM except that it ignores the results from the actions. This is a common pattern, and there is also forM_, sequence_ and so on. \$\endgroup\$

beta
– beta

2013年04月05日 16:42:47 +00:00
Commented Apr 5, 2013 at 16:42

Add a comment |

Stack Exchange Network

Counting word frequencies in Haskell

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Counting word frequencies in Haskell

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions