Haskell requires more memory than Python when I read map from file. Why?

Question 1

I have this simple code in Python:

input = open("baseforms.txt","r",encoding='utf8')
S = {}
for i in input:
 words = i.split()
 S.update( {j:words[0] for j in words} )
print(S.get("sometext","not found"))
print(len(S))

It requires 300MB for work. "baseforms.txt" size is 123M.

I've written the same code in Haskell:

{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Text.Lazy.Encoding(decodeUtf8)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as I
import Control.Monad(liftM)
main = do
 text <- B.readFile "baseforms.txt"
 let m = (M.fromList . (concatMap (parseLine.decodeUtf8))) (B.lines text)
 print (M.lookup "sometext" m)
 print (M.size m)
 where
 parseLine line = let base:forms = T.words line in [(f,base)| f<-forms]

It requires 544 MB and it's slower than Python version. Why? Is it possible to optimise Haskell version?

Data file can be downloaded here

I've tried non-lazy version of Data.Text and Data.Text.IO - memory usage is near 650MB

I've also tried hashtables, but memory usage growth up to 870 MB

Question 2

I can't seem to download your data set. Could you re-upload a smaller dataset (say, ~5mb? and with your testing statistics for it) to Pastebin or somewhere?

The only obvious performance difference with your Python version is that Data.Map is based on a binary tree, whereas Python's dict is a hashtable. This makes fromList an O(n * log n) operation in Haskell (unless you luck into the fact that your list of (form, base) pairs happen to all be distinct and sorted in ascending order) but constructing the Python dict is O(n). This should make a significant time difference, but not necessarily space, so at this point I'm out of ideas. Make sure you're compiling with -O2 I guess.

(There also seems to be a functional difference between your two versions if my understanding of Python hasn't rusted off after years of neglect. {j:words[0] for j in words} will pair words[0] with itself, so maybe you want parseLine to be defined as let forms@(base:_) = ....)

On issues of style, see the below rewrite. Note particularly the reordering of definitions so as to eliminate the where-clause. In your version it was easy to miss due to being inline with the do-block, you can do as I did, or try decreasing the indent.

{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Map as Map
import qualified Data.ByteString.Lazy as ByteString
import qualified Data.Text.Lazy as Text
import Data.Text.Lazy.Encoding (decodeUtf8)
main :: IO ()
main = do
 text <- fmap decodeUtf8 $ ByteString.readFile "baseforms.txt"
 let parseLine line = let (base:forms) = Text.words line
 in zip forms (repeat base)
 m = Map.fromList . concatMap parseLine . Text.lines $ text
 print $ Map.lookup "sometext" m
 print $ Map.size m

Question 3

Beginning with GHC 6.12, text I/O is performed using the system or handle's current locale and line ending conventions My locale encoding is not utf8, but file in utf8

Question 4

Huh, so it does. I'm getting to my dotage I guess.

Question 5

Wow, memory usage decreased to 467 MB total memory in use (26 MB lost due to fragmentation) with new version :)

Question 6

Progress! I think that is probably due to fusion, Text.lines isn't subject to fusion so it helps to decode first as decoding is subject to fusion. Try switching to Data.Map.Strict next, I suspect that the Map is accumulating a whole pile of thunks to small Text values. You shouldn't need to change anything but the import line.

Question 7

Switch to Data.Map.Strict have no effect, but switching to strict Text and ByteString decreased memory usage to 374 MB total memory in use (0 MB lost due to fragmentation) - thank you for your help!

bisserlis bisserlis 3,3311 gold badge13 silver badges17 bronze badges · Accepted Answer · 2015-03-25 21:12:51Z

I can't seem to download your data set. Could you re-upload a smaller dataset (say, ~5mb? and with your testing statistics for it) to Pastebin or somewhere?

The only obvious performance difference with your Python version is that Data.Map is based on a binary tree, whereas Python's dict is a hashtable. This makes fromList an O(n * log n) operation in Haskell (unless you luck into the fact that your list of (form, base) pairs happen to all be distinct and sorted in ascending order) but constructing the Python dict is O(n). This should make a significant time difference, but not necessarily space, so at this point I'm out of ideas. Make sure you're compiling with -O2 I guess.

(There also seems to be a functional difference between your two versions if my understanding of Python hasn't rusted off after years of neglect. {j:words[0] for j in words} will pair words[0] with itself, so maybe you want parseLine to be defined as let forms@(base:_) = ....)

On issues of style, see the below rewrite. Note particularly the reordering of definitions so as to eliminate the where-clause. In your version it was easy to miss due to being inline with the do-block, you can do as I did, or try decreasing the indent.

{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Map as Map
import qualified Data.ByteString.Lazy as ByteString
import qualified Data.Text.Lazy as Text
import Data.Text.Lazy.Encoding (decodeUtf8)
main :: IO ()
main = do
 text <- fmap decodeUtf8 $ ByteString.readFile "baseforms.txt"
 let parseLine line = let (base:forms) = Text.words line
 in zip forms (repeat base)
 m = Map.fromList . concatMap parseLine . Text.lines $ text
 print $ Map.lookup "sometext" m
 print $ Map.size m

Beginning with GHC 6.12, text I/O is performed using the system or handle's current locale and line ending conventions My locale encoding is not utf8, but file in utf8
Wow, memory usage decreased to 467 MB total memory in use (26 MB lost due to fragmentation) with new version :)
Progress! I think that is probably due to fusion, Text.lines isn't subject to fusion so it helps to decode first as decoding is subject to fusion. Try switching to Data.Map.Strict next, I suspect that the Map is accumulating a whole pile of thunks to small Text values. You shouldn't need to change anything but the import line.
Switch to Data.Map.Strict have no effect, but switching to strict Text and ByteString decreased memory usage to 374 MB total memory in use (0 MB lost due to fragmentation) - thank you for your help!

Stack Exchange Network

Haskell requires more memory than Python when I read map from file. Why?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Haskell requires more memory than Python when I read map from file. Why?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions