10
\$\begingroup\$

I started learning Haskell to see if I can use it at my job. A lot of my work is processing text files for data extraction and analysis.

For my first test, I added a counter at the end of each line from a .csv text file (currently I don't care about the format management).

My current code in Haskell is:

import qualified Data.ByteString.Lazy.Char8 as L
addRecordId :: String -> String -> Int -> String
addRecordId "" _ _ = ""
addRecordId rec sep cnt = rec ++ sep ++ show cnt
addIncrementalId :: String -> [String] -> [String]
addIncrementalId _ [] = []
addIncrementalId sep ls = addId ls 1
 where
 addId [] _ = []
 addId (l:ls) cnt = addRecordId l sep cnt : addId ls (cnt + 1)
identifyFile :: FilePath -> String -> IO [String]
identifyFile path sep = do
 inpStr <- L.readFile path
 return (addIncrementalId sep (lines (L.unpack inpStr)))
printLnIdentifiedFile :: IO [String] -> IO ()
printLnIdentifiedFile ls = do
 lines <- ls
 putStr (unlines lines)
main = printLnIdentifiedFile (identifyFile "myfile.csv" ";")

This code processes a file of 1GB (4,845,000 records) in 90 seconds.

This C code below does the same job in 10 seconds:

#include <stdio.h>
#include <stdlib.h>
int main() {
 FILE *f = fopen("myfile.csv", "r");
 size_t bytes_read;
 size_t current_buffer_size = 400;
 char *buffer = calloc(current_buffer_size, 1);
 long cnt = 1;
 while ((bytes_read = getline(&buffer, &current_buffer_size, f)) > 0) {
 if (feof(f)) break;
 buffer[ bytes_read - 2 ] = 0;
 printf("%s;%ld\n", buffer, cnt++);
 }
 fclose(f);
 return 0;
}

And the Java code below does the job in 30 seconds:

package test.perf.numadr;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
public class NumAdr {
 static public void main(String[] args) {
 BufferedReader br = null;
 try {
 String sCurrentLine;
 br = new BufferedReader(new FileReader("myfile.csv"));
 int cnt = 1;
 String lineWithId;
 while ((sCurrentLine = br.readLine()) != null) {
 lineWithId = sCurrentLine + ";" + cnt;
 cnt++;
 System.out.println(lineWithId);
 }
 } catch (IOException e) {
 e.printStackTrace();
 } finally {
 try {
 if (br != null)br.close();
 } catch (IOException ex) {
 ex.printStackTrace();
 }
 }
 }
}

For each test, I print the result to stdout and redirect to a result file.

My Haskell code is 3 times slower than the Java code and 9 times than the C code. As I'm a beginner in Haskell, I think my Haskell code is not the best.

How can I improve my program?

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked May 17, 2014 at 20:18
\$\endgroup\$
1
  • \$\begingroup\$ @ChriX (if you're still around), if your job includes a lot of text processing and data extraction, you should really learn the tools for the job: sed, awk and perl, in order of increasing complexity. For instance—if I understand correctly that all this code does is append line numbers after a semicolon, you can do the same with sed = filename | sed -n 'h;n;G;s/\n/;/;p' > outputfile. sed could be called the assembly language of text processing. \$\endgroup\$ Commented Oct 25, 2015 at 14:23

2 Answers 2

12
\$\begingroup\$

I think your program needs to be more Haskell-style (and shorter). Here is my rewrite:

import qualified Data.ByteString.Lazy.Char8 as L
processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines out
 where out = zipWith f [1..] (L.lines contents)
 sep = L.pack ";"
 f n l = l `L.append` sep `L.append` L.pack (show n)
main = do
 contents <- L.readFile "myfile.csv"
 L.putStr (processContents contents)

Need better speed? It is trivial to convert this code into parallel one.

import qualified Data.ByteString.Lazy.Char8 as L
import Control.Parallel.Strategies
import GHC.Conc(numCapabilities)
processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines (out `using` parListChunk chunks rdeepseq)
 where out = zipWith f [1..] (L.lines contents)
 sep = L.pack ";"
 f n l = l `L.append` sep `L.append` L.pack (show n)
 chunks = 1 + (length out `div` numCapabilities)
main = do
 contents <- L.readFile "myfile.csv"
 L.putStr (processContents contents)

To compile it use the following command:

ghc -O2 -threaded -with-rtsopts=-N program.hs
answered May 17, 2014 at 20:58
\$\endgroup\$
14
  • 1
    \$\begingroup\$ Your code take the same time (12 secondes). But waouh!!! I think I must take a look to your code to understand everything :) \$\endgroup\$ Commented May 17, 2014 at 21:05
  • \$\begingroup\$ I use a few functions from standard library. Nothing fancy. Btw, now it is trivial to further increase performance by using parallel map and therefore utilizing all cores. :-) \$\endgroup\$ Commented May 17, 2014 at 21:10
  • \$\begingroup\$ I started learning Haskell for one week by reading the real world haskell and currently, I don't know the meaning of the black quote. What is its purpose? \$\endgroup\$ Commented May 17, 2014 at 21:18
  • 1
    \$\begingroup\$ It is just syntactic sugar for functions of two arguments. a `f` b is equivalent to f a b. \$\endgroup\$ Commented May 17, 2014 at 21:20
  • 1
    \$\begingroup\$ FYI, you might try using zipWith f [1..] (L.lines contents) instead of map (\ (a, b) -> foo) (zip [1..] (L.lines contents)). \$\endgroup\$ Commented May 19, 2014 at 6:37
5
\$\begingroup\$

An obvious bottleneck is the conversion to String and back. Try changing the type signatures to

addRecordId :: L.ByteString -> String -> Int -> L.ByteString
addIncrementalId :: String -> [L.ByteString] -> [L.ByteString]
identifyFile :: FilePath -> L.ByteString -> IO [L.ByteString]

ByteString.Lazy.Char8 has its own lines, unlines and putStr which you can use, as well as cons or append for constructing your annotated line.

answered May 17, 2014 at 20:31
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.