3
\$\begingroup\$

I am working on writing code for find difference between two text files that should be ideally same.

the format of file is docid_{num}\t{num}\t{num0},{f0} {num1},{f1} ..... {numN},{fN}\n

eg: docid_0\t300\t5,2 4,3 9,2\n

adding link to relative big file that can be used to test: https://gist.github.com/Lightyagami1/a539a4e311fef9104fb467021af8246c (small 5k lines file)

https://drive.google.com/file/d/1Z3MrokMtWyNH9BrnaAiljFgE25BnwFQ8/view?usp=sharing (10K lines file)

to perform this operation efficiently I have sorted both files based on numerical value within docid_{num} (num value) . and then wish to use an approach similar to 2 pointers.

that is assume N = docid_{n} (from file1) M = docid_{m} (from file2)

here I intend to use N and M as indexes. (again mentioning both files are sorted)

if N > M : docid_{N} is not present in file2 else if N < M : docid_{M} is not present in file1 else : both file contain doc_id with same values.

now the haskell code that I have written doesn't seems to be as great as similar golang code. golang code takes roughly 2 seconds while this take 35 seconds. Any tips to improve it are welcomed.

I understand both are not exactly same, but I have tried to make main algorithm same.

result of profiling code, compiled with -O2 optimization flag.

 diff7 +RTS -sstderr -p -RTS sa3_10000 sa3_1_10000
 total time = 34.05 secs (34052 ticks @ 1000 us, 1 processor)
 total alloc = 97,676,092,088 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
readAsInt Main diff7.hs:44:1-39 77.1 76.3
readBothTogether.wrds1 Main diff7.hs:146:5-43 4.8 5.9
readBothTogether.wrds2 Main diff7.hs:147:5-43 4.8 5.9
splitInner.res Main diff7.hs:37:5-45 4.4 6.4
compare'.freqs1 Main diff7.hs:173:5-57 1.9 1.5
compare'.freqs2 Main diff7.hs:177:5-57 1.8 1.5
makePairs Main diff7.hs:41:1-77 0.9 1.5

haskell code

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE Strict #-}
import qualified Data.IntMap.Strict as IM (IntMap, fromList, difference, keys, intersection, toList, lookup, findWithDefault, empty, size)
import System.Environment
import qualified Data.Text as L
import Data.Text.IO as LTIO
import Data.Int
splitter :: Char -> Bool
splitter ' ' = True
splitter '\t' = True
splitter _ = False
splitInner :: [L.Text] -> [(Int, Int)]
splitInner inp = res1
 where
 res = L.splitOn (L.singleton ',') <$> inp
 res1 = makePairs res
makePairs :: [[L.Text]] -> [(Int, Int)]
makePairs = map (\x -> (readAsInt . head $ x, readAsInt . (head . tail) $ x))
readAsInt :: L.Text -> Int
readAsInt x = read $! L.unpack x :: Int
 {-
 Comparing result of two files need to take care of:
 + docuemtns present in result of only 1 file
 + common documents (present in both file's result)
 - missing token in one of file's result.
 - common token, but frequency different
 - happy scenario, frequency match too.
 -}
data DiffStruct =
 MkDiffStruct
 { documentsPresentInBoth :: Int
 , documentsPresentOnlyInFirst :: Int
 , documentsPresentOnlyInSecond :: Int
 , documentsTokenCountDifferent :: Int
 , documentsTokenFrequencyDifferent :: Int64
 , documentsTokenFrequencySame :: Int64
 }
 deriving (Show)
correctingFactor = 14 -- 14 is constant difference due to algo difference
readBothTogether :: L.Text -> L.Text -> DiffStruct
readBothTogether t1 t2 = MkDiffStruct a b c d e f
 where
 wrds1 = L.split splitter <$> L.lines t1
 wrds2 = L.split splitter <$> L.lines t2
 (a,b,c,d,e,f) = compare' wrds1 wrds2
add' :: (Int,Int, Int, Int, Int64, Int64) -> (Int, Int, Int, Int, Int64, Int64) -> (Int, Int, Int, Int, Int64, Int64)
add' (a1, a2, a3, a4, a5, a6) (b1, b2, b3, b4, b5, b6) = (a1+b1, a2+b2, a3+b3, a4+b4, a5+b5, a6+b6)
{-
 add' will contain
 - document present only in first
 - document present only in second
 * same document
 - token present only in first
 - token present only in second
 - token present in both but different frequency
 - token present in both and same frequency
-}
compare' :: [[L.Text]] -> [[L.Text]] -> (Int, Int, Int, Int, Int64, Int64)
compare' _ [] = (0,0,0,0,0,0)
compare' [] _ = (0,0,0,0,0,0)
compare' inp1@(x:xs) inp2@(y:ys)
 | head1 > head2 = add' (1,0,0,0,0,0) $ compare' xs inp2
 | head1 < head2 = add' (0,1,0,0,0,0) $ compare' inp1 ys
 | otherwise = add' (0, 0, tokensPresentOnlyInFirst, tokensPresentOnlyInSecond, sameVal, diffVal) $ compare' xs ys
 where
 head1 = head x
 seconds1 = readAsInt . head . tail $ x
 freqs1 = IM.fromList . splitInner . drop 2 . init $ x
 head2 = head y
 seconds2 = readAsInt . head . tail $ y
 freqs2 = IM.fromList . splitInner . drop 2 . init $ y
 tokensPresentOnlyInFirst = IM.size $ IM.difference freqs1 freqs2
 tokensPresentOnlyInSecond = IM.size $ IM.difference freqs2 freqs1
 commonKeys = IM.intersection freqs1 freqs2
 (sameVal, diffVal) = compareCommonKeysInTwoMaps (IM.keys commonKeys) freqs1 freqs2
compareCommonKeysInTwoMaps :: [Int] -> IM.IntMap Int -> IM.IntMap Int -> (Int64, Int64)
compareCommonKeysInTwoMaps [] _ _ = (0, 0)
compareCommonKeysInTwoMaps (x:xs) m1 m2
 | val1 == val2 = add2 (1, 0) $ compareCommonKeysInTwoMaps xs m1 m2
 | otherwise = add2 (0, 1) $ compareCommonKeysInTwoMaps xs m1 m2
 where
 val1 = IM.findWithDefault (-1) x m1
 val2 = IM.findWithDefault (-1) x m2
add2 :: (Int64, Int64) -> (Int64, Int64) -> (Int64, Int64)
add2 (a1, a2) (b1, b2) = (a1+b1, a2+b2)
main :: IO ()
main = do
 args <- getArgs
 let fp1 = head args
 fp2 = args !! 1
 print fp1
 print fp2
 inp1 <- LTIO.readFile fp1
 inp2 <- LTIO.readFile fp2
 print $ readBothTogether inp1 inp2

Adding golang code I'm comparing with

package main
import (
 "bufio"
 "fmt"
 "io"
 "log"
 "os"
 "strconv"
 "strings"
)
func main() {
 f1n := os.Args[1]
 f2n := os.Args[2]
 fmt.Println("first file: ", f1n)
 fmt.Println("second file: ", f2n)
 f1, err := os.Open(f1n)
 if err != nil {
 log.Fatalf("failed to open file1")
 }
 f2, err := os.Open(f2n)
 if err != nil {
 log.Fatalf("failed to open file2")
 }
 defer f1.Close()
 defer f2.Close()
 var line1 string
 var line2 string
 scanner1 := bufio.NewReader(f1)
 scanner2 := bufio.NewReader(f2)
 
 docPresentOnlyInFirst := 0
 docPresentOnlyInSecond := 0
 tokenPresentOnlyInFirst := 0
 tokenPresentOnlyInSecond := 0
 tokenPresentInBothSameFreq := 0
 tokenPresentInBothDiffFreq := 0
 i, j, ind := 0, 0, 0
 inc1, inc2 := true, true
 for {
 if inc1 {
 line1, err = scanner1.ReadString('\n')
 if line1 == "" || (err != nil && err != io.EOF) {
 break
 }
 // As the line contains newline "\n" character at the end, we could remove it.
 line1 = line1[:len(line1)-1]
 }
 if inc2 {
 line2, err = scanner2.ReadString('\n')
 if line2 == "" || (err != nil && err != io.EOF) {
 break
 }
 // As the line contains newline "\n" character at the end, we could remove it.
 line2 = line2[:len(line2)-1]
 }
 Doc1, f1 := lineParser(line1)
 Doc2, f2 := lineParser(line2)
 if Doc1 > Doc2 {
 docPresentOnlyInFirst++
 j++
 inc1 = false
 } else if Doc1 < Doc2 {
 docPresentOnlyInSecond++
 i++
 inc2 = false
 } else {
 a, b, c, d := compareFreq(f1, f2)
 tokenPresentOnlyInFirst += a
 tokenPresentOnlyInSecond += b
 tokenPresentInBothSameFreq += c
 tokenPresentInBothDiffFreq += d
 i++
 j++
 inc1, inc2 = true, true
 }
 if ind%50000 == 0 {
 fmt.Println("currently processing ", i, Doc1, j, Doc2, ind)
 }
 ind++
 }
 fmt.Println("total documents processed ", i, j, ind)
 fmt.Println("docPresentOnlyInFirst: ", docPresentOnlyInFirst)
 fmt.Println("docPresentOnlyInSecond: ", docPresentOnlyInSecond)
 fmt.Println("tokenPresentOnlyInFirst: ", tokenPresentOnlyInFirst)
 fmt.Println("tokenPresentOnlyInSecond: ", tokenPresentOnlyInSecond)
 fmt.Println("tokenPresentInBothSameFreq: ", tokenPresentInBothSameFreq)
 fmt.Println("tokenPresentInBothDiffFreq: ", tokenPresentInBothDiffFreq)
}
func compareFreq(f1, f2 map[int]int) (int, int, int, int) {
 a, c, d := onlyFirst(f1, f2)
 b, _, _ := onlyFirst(f2, f1)
 return a, b, c, d
}
func onlyFirst(f1, f2 map[int]int) (int, int, int) {
 a, d, c := 0, 0, 0
 for k1, v1 := range f1 {
 if v2, ok := f2[k1]; !ok {
 a++
 } else {
 if v1 == v2 {
 c++
 } else {
 d++
 }
 }
 }
 return a, c, d
}
func SplitOnNonLetters(s string) []string {
 return strings.Fields(s)
}
func lineParser(line string) (int, map[int]int) {
 parts := SplitOnNonLetters(line)
 if len(parts) <= 0 {
 tmp := make(map[int]int)
 return 0, tmp
 }
 docId, err := strconv.Atoi(parts[0][6:])
 if err != nil {
 log.Fatalf("failed to parse dociId %v", docId)
 }
 // unigramCnt, _ := strconv.Atoi(parts[1])
 val := parts[2:]
 count := parseCommaSep(val)
 return docId, count
}
func parseCommaSep(inp []string) map[int]int {
 tmp := make(map[int]int)
 for _, pair := range inp {
 keyVal := strings.Split(pair, ",")
 key, err := strconv.Atoi(keyVal[0])
 if err != nil {
 log.Fatalf("failed to parse key %v", key)
 }
 val, err := strconv.Atoi(keyVal[1])
 if err != nil {
 log.Fatalf("failed to parse value %v", val)
 }
 tmp[key] = val
 }
 return tmp
}

Edit1: Using below code haskell code could reach till ~500% run time of go code, though there are significant divergence between both now such as

  • never converting to int value
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE Strict #-}
{-# LANGUAGE RecordWildCards #-}
import qualified Data.Map.Strict as M (Map, fromList, difference, keys, intersection, toList, lookup, findWithDefault, empty, size)
import System.Environment
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Int
import qualified Data.List as L
splitter :: Char -> Bool
splitter ' ' = True
splitter '\t' = True
splitter _ = False
{-# INLINE splitter #-}
splitInner :: [B.ByteString] -> [(B.ByteString, B.ByteString)]
splitInner inp = res1
 where
 res = B.split ',' <$> inp
 res1 = makePairs res
{-# INLINE splitInner #-}
makePairs :: [[B.ByteString]] -> [(B.ByteString, B.ByteString)]
makePairs = map (\x -> (head x, (head . tail) x))
{-# INLINE makePairs #-}
{-
 add' will contain
 - document present only in first
 - document present only in second
 * same document
 - token present only in first
 - token present only in second
 - token present in both but different frequency
 - token present in both and same frequency
-}
data DiffStruct =
 MkDiffStruct
 { documentsPresentOnlyInFirst :: Int
 , documentsPresentOnlyInSecond :: Int
 , tokensPresentOnlyInFirst :: Int
 , tokensPresentOnlyInSecond :: Int
 , tokenFrequencyDifferent :: Int
 , tokenFrequencySame :: Int
 }
 deriving (Show)
readBothTogether :: B.ByteString -> B.ByteString -> DiffStruct
readBothTogether t1 t2 = MkDiffStruct a b c d e f
 where
 wrds1 = B.splitWith splitter <$> B.lines t1
 wrds2 = B.splitWith splitter <$> B.lines t2
 (a,b,c,d,e,f) = compare' wrds1 wrds2 (0,0,0,0,0,0)
{-# INLINE readBothTogether #-}
data CState = CState
 { dpoif ::Int
 , dpois :: Int
 , tpoif :: Int
 , tpois ::Int
 , tfd :: Int
 , tfs :: Int
 }
add' :: (Int,Int, Int, Int, Int, Int) -> (Int, Int, Int, Int, Int, Int) -> (Int, Int, Int, Int, Int, Int)
add' (a1, a2, a3, a4, a5, a6) (b1, b2, b3, b4, b5, b6) = (a1+b1, a2+b2, a3+b3, a4+b4, a5+b5, a6+b6)
{-# INLINE add' #-}
compare' :: [[B.ByteString]] -> [[B.ByteString]] -> (Int, Int, Int, Int, Int, Int) -> (Int, Int, Int, Int, Int, Int)
compare' _ [] acc = acc
compare' [] _ acc = acc
compare' inp1@(x:xs) inp2@(y:ys) acc
 | head1 > head2 = compare' xs inp2 (add' (1,0,0,0,0,0) acc)
 | head1 < head2 = compare' inp1 ys (add' (0,1,0,0,0,0) acc)
 | otherwise = compare' xs ys (add' (0, 0, tokensPresentOnlyInFirst, tokensPresentOnlyInSecond, sameVal, diffVal) acc)
 where
 head1 = head x
 freqs1 = M.fromList . splitInner . drop 2 $ x
 head2 = head y
 freqs2 = M.fromList . splitInner . drop 2 $ y
 (tokensPresentOnlyInFirst, sameVal, diffVal) = compareTwoMaps freqs1 freqs2
 (tokensPresentOnlyInSecond, _, _) = compareTwoMaps freqs2 freqs1
{-# INLINE compare' #-}
data Diff = Diff
 { tokenOnlyFirst :: Int
 , tokenFreqSame :: Int
 , tokenFreqDiff :: Int
 }
compareTwoMaps :: M.Map B.ByteString B.ByteString -> M.Map B.ByteString B.ByteString -> (Int, Int, Int)
compareTwoMaps m1 m2 = (tokenOnlyFirst, tokenFreqSame, tokenFreqDiff)
 where
 Diff { .. } = L.foldl' go (Diff 0 0 0) keysOfFirst
 keysOfFirst = M.keys m1
 go Diff { .. } c = Diff (tokenOnlyFirst + onlyf) (tokenFreqSame + same) (tokenFreqDiff + diff)
 where
 {-# INLINE val1 #-}
 val1 = M.findWithDefault "-1" c m1
 {-# INLINE val2 #-}
 val2 = M.findWithDefault "-1" c m2
 {-# INLINE onlyf #-}
 onlyf | val2 == "-1" = 1
 | otherwise = 0
 {-# INLINE same #-}
 same | val1 == val2 = 1
 | otherwise = 0
 {-# INLINE diff #-}
 diff | val1 == val2 = 0
 | otherwise = 1
{-# INLINE compareTwoMaps #-}
main :: IO ()
main = do
 args <- getArgs
 let fp1 = head args
 fp2 = args !! 1
 inp1 <- B.readFile fp1
 inp2 <- B.readFile fp2
 print $ readBothTogether inp1 inp2
```
asked Jul 1, 2021 at 19:33
\$\endgroup\$
3
  • 2
    \$\begingroup\$ Do I understand correctly that you have written the same program in 2 languages and want to improve the execution time of the haskell implementation? \$\endgroup\$ Commented Jul 2, 2021 at 11:53
  • \$\begingroup\$ yes, @pacmaninbw that is the intent. \$\endgroup\$ Commented Jul 3, 2021 at 15:32
  • \$\begingroup\$ Same question was asked at haskell's reddit group. the user Noughtmare came up with very efficient re-implementation: reddit.com/r/haskell/comments/odfpoe/… \$\endgroup\$ Commented Jul 11, 2021 at 16:46

2 Answers 2

4
\$\begingroup\$

On my machine, your original Haskell version run on two copies of the "sa3_10000" test file takes about 20 seconds, while your Go version takes about 2 seconds. Note that profiled GHC binaries run significantly slower, even if you don't generate a profile, so you'll want to make sure you're timing an unprofiled binary when making these comparisons.

Anyway, the profiling you did shows that the most time was being spent in readAsInt, so it might be worth targeting that.

For historical reasons, read works on Strings, and the process of converting a Text to a String and processing the resulting linked list of characters is really slow. Data.Text.Read provides a decimal function that can do a better job:

readAsInt :: L.Text -> Int
readAsInt x = let Right (n, "") = L.decimal x in n

That change reduces the runtime from 20 seconds to 5 seconds, so about 2.5x the Go version. Re-profiling reveals no obvious additional bottlenecks.

I'm not sure I'd spend any more time trying to improve the performance of the Haskell version. Any additional improvements are likely to be modest, unless you make a truly extraordinary effort.

answered Jul 7, 2021 at 21:25
\$\endgroup\$
3
  • 1
    \$\begingroup\$ Thank you a lot. I'm using Data.ByteString.Char8.readInt instead, my program is extraordinary faster. \$\endgroup\$ Commented Jul 11, 2021 at 6:29
  • \$\begingroup\$ @HaruoWakakusa, do you mind comparing your implementation with the lazy version of Noughtmare's implementation (reddit.com/r/haskell/comments/odfpoe/…) \$\endgroup\$ Commented Jul 11, 2021 at 16:49
  • \$\begingroup\$ I deleted my post because it is not beautiful. I think you should make your answer and close this question. \$\endgroup\$ Commented Jul 12, 2021 at 13:49
1
\$\begingroup\$

Adding final version, as suggested by @haruo

Below code snippet was originally authored by user Noughtmare in response https://www.reddit.com/r/haskell/comments/odfpoe/looking_for_ways_to_improve_performance_of/?utm_source=share&utm_medium=web2x&context=3 at haskell's reddit group with minor tweaking at my end: https://www.reddit.com/r/haskell/comments/odfpoe/looking_for_ways_to_improve_performance_of/h40dd38?utm_source=share&utm_medium=web2x&context=3

with this run time is reduced to ~50% of initial go code present in question. the most interesting changes are: (quoting from @Noughmare's comment)

  • Better parser that does everything in one pass
  • Convert bytestrings to int early so that later comparisons are fast
{-# LANGUAGE DeriveFunctor, BangPatterns #-}
import qualified Data.ByteString.Lazy as B
import Data.ByteString.Lazy (ByteString)
import qualified Data.IntMap.Strict as IntMap
import Data.IntMap.Strict (IntMap)
import qualified Data.IntMap.Merge.Strict as Merge
import Data.Word
import Data.Foldable
import System.Environment
import Data.Int
-- a simple non-backtracking parser, equivalent to a state monad
-- this will throw errors on failure
newtype Parser a = Parser { parse :: ByteString -> (a, ByteString) }
 deriving Functor
instance Applicative Parser where
 pure x = Parser (\s -> (x, s))
 Parser p <*> Parser q = Parser $ \s ->
 let
 (f, s') = p s
 (x, s'') = q s'
 x' = f x
 in (x', s'')
instance Monad Parser where
 Parser p >>= f = Parser $ \s ->
 let
 (x, s') = p s
 in parse (f x) s'
pInt :: Parser Int
pInt = Parser (go 0) where
 go !n !s
 | h < 10 = go (10 * n + fromIntegral h) (B.tail s)
 | otherwise = (n, s)
 where
 h = B.head s - 48
pDrop :: Int64 -> Parser ()
pDrop n = Parser (\s -> ((), B.drop n s))
pTail :: Parser ()
pTail = Parser (\s -> ((), B.tail s))
pNext :: Parser Word8
pNext = Parser (\s -> (B.head s, B.tail s))
pLine :: Parser (Int, Int, IntMap Int)
pLine = (,,) <$ pDrop 6 <*> pInt <* pTail <*> pInt <* pTail <*> go IntMap.empty
 where
 go !xs = do
 (k, v) <- (,) <$> pInt <* pTail <*> pInt
 h <- pNext
 case h of
 10 -> pure (IntMap.insert k v xs)
 _ -> go (IntMap.insert k v xs)
pLines :: Parser [(Int, Int, IntMap Int)]
pLines = (:) <$> pLine <*> ifNotEof pLines where
 ifNotEof p = Parser (\s -> if B.null s then ([], s) else parse p s)
data CState = CState
 { dpoif :: {-# UNPACK #-} !Int
 , dpois :: {-# UNPACK #-} !Int
 , tstate :: !TState
 } deriving Show
instance Semigroup CState where
 CState x1 x2 x3 <> CState y1 y2 y3 = 
 CState (x1 + y1) (x2 + y2) (x3 <> y3)
instance Monoid CState where
 mempty = CState 0 0 mempty
data TState = TState
 { tpoif :: {-# UNPACK #-} !Int
 , tpois :: {-# UNPACK #-} !Int
 , tfd :: {-# UNPACK #-} !Int
 , tfs :: {-# UNPACK #-} !Int
 } deriving Show
instance Semigroup TState where
 TState x1 x2 x3 x4 <> TState y1 y2 y3 y4 = 
 TState (x1 + y1) (x2 + y2) (x3 + y3) (x4 + y4)
instance Monoid TState where
 mempty = TState 0 0 0 0
compare' :: [(Int, Int, IntMap Int)] -> [(Int, Int, IntMap Int)] -> CState -> CState
compare' xs [] !s = s <> CState (length xs) 0 mempty
compare' [] ys !s = s <> CState 0 (length ys) mempty
compare' xs@((x1,x2,x3):xs') ys@((y1,y2,y3):ys') !s = case compare x1 y1 of
 LT -> compare' xs' ys (s <> CState 1 0 mempty)
 GT -> compare' xs ys' (s <> CState 0 1 mempty)
 EQ -> compare' xs' ys' (s <> CState 0 0 (compare'' x3 y3))
compare'' :: IntMap Int -> IntMap Int -> TState
compare'' xs ys = fold $ Merge.merge
 (Merge.mapMissing (\_ _ -> TState 1 0 0 0))
 (Merge.mapMissing (\_ _ -> TState 0 1 0 0))
 (Merge.zipWithMatched (\_ x y -> if x == y then TState 0 0 1 0 else TState 0 0 0 1))
 xs
 ys
main :: IO ()
main = do
 fp1:fp2:_ <- getArgs
 xs <- B.readFile fp1
 ys <- B.readFile fp2
 let
 pxs = fst $ parse pLines xs
 pys = fst $ parse pLines ys
 print $ compare' pxs pys mempty
```
answered Jul 20, 2021 at 12:32
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.