Processing text files for data extraction and analysis

Question 1

I started learning Haskell to see if I can use it at my job. A lot of my work is processing text files for data extraction and analysis.

For my first test, I added a counter at the end of each line from a .csv text file (currently I don't care about the format management).

My current code in Haskell is:

import qualified Data.ByteString.Lazy.Char8 as L
addRecordId :: String -> String -> Int -> String
addRecordId "" _ _ = ""
addRecordId rec sep cnt = rec ++ sep ++ show cnt
addIncrementalId :: String -> [String] -> [String]
addIncrementalId _ [] = []
addIncrementalId sep ls = addId ls 1
 where
 addId [] _ = []
 addId (l:ls) cnt = addRecordId l sep cnt : addId ls (cnt + 1)
identifyFile :: FilePath -> String -> IO [String]
identifyFile path sep = do
 inpStr <- L.readFile path
 return (addIncrementalId sep (lines (L.unpack inpStr)))
printLnIdentifiedFile :: IO [String] -> IO ()
printLnIdentifiedFile ls = do
 lines <- ls
 putStr (unlines lines)
main = printLnIdentifiedFile (identifyFile "myfile.csv" ";")

This code processes a file of 1GB (4,845,000 records) in 90 seconds.

This C code below does the same job in 10 seconds:

#include <stdio.h>
#include <stdlib.h>
int main() {
 FILE *f = fopen("myfile.csv", "r");
 size_t bytes_read;
 size_t current_buffer_size = 400;
 char *buffer = calloc(current_buffer_size, 1);
 long cnt = 1;
 while ((bytes_read = getline(&buffer, &current_buffer_size, f)) > 0) {
 if (feof(f)) break;
 buffer[ bytes_read - 2 ] = 0;
 printf("%s;%ld\n", buffer, cnt++);
 }
 fclose(f);
 return 0;
}

And the Java code below does the job in 30 seconds:

package test.perf.numadr;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
public class NumAdr {
 static public void main(String[] args) {
 BufferedReader br = null;
 try {
 String sCurrentLine;
 br = new BufferedReader(new FileReader("myfile.csv"));
 int cnt = 1;
 String lineWithId;
 while ((sCurrentLine = br.readLine()) != null) {
 lineWithId = sCurrentLine + ";" + cnt;
 cnt++;
 System.out.println(lineWithId);
 }
 } catch (IOException e) {
 e.printStackTrace();
 } finally {
 try {
 if (br != null)br.close();
 } catch (IOException ex) {
 ex.printStackTrace();
 }
 }
 }
}

For each test, I print the result to stdout and redirect to a result file.

My Haskell code is 3 times slower than the Java code and 9 times than the C code. As I'm a beginner in Haskell, I think my Haskell code is not the best.

How can I improve my program?

Question 2

@ChriX (if you're still around), if your job includes a lot of text processing and data extraction, you should really learn the tools for the job: sed, awk and perl, in order of increasing complexity. For instance—if I understand correctly that all this code does is append line numbers after a semicolon, you can do the same with sed = filename | sed -n 'h;n;G;s/\n/;/;p' > outputfile. sed could be called the assembly language of text processing.

Question 3

I think your program needs to be more Haskell-style (and shorter). Here is my rewrite:

import qualified Data.ByteString.Lazy.Char8 as L
processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines out
 where out = zipWith f [1..] (L.lines contents)
 sep = L.pack ";"
 f n l = l `L.append` sep `L.append` L.pack (show n)
main = do
 contents <- L.readFile "myfile.csv"
 L.putStr (processContents contents)

Need better speed? It is trivial to convert this code into parallel one.

import qualified Data.ByteString.Lazy.Char8 as L
import Control.Parallel.Strategies
import GHC.Conc(numCapabilities)
processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines (out `using` parListChunk chunks rdeepseq)
 where out = zipWith f [1..] (L.lines contents)
 sep = L.pack ";"
 f n l = l `L.append` sep `L.append` L.pack (show n)
 chunks = 1 + (length out `div` numCapabilities)
main = do
 contents <- L.readFile "myfile.csv"
 L.putStr (processContents contents)

To compile it use the following command:

ghc -O2 -threaded -with-rtsopts=-N program.hs

Question 4

Your code take the same time (12 secondes). But waouh!!! I think I must take a look to your code to understand everything :)

Question 5

I use a few functions from standard library. Nothing fancy. Btw, now it is trivial to further increase performance by using parallel map and therefore utilizing all cores. :-)

Question 6

I started learning Haskell for one week by reading the real world haskell and currently, I don't know the meaning of the black quote. What is its purpose?

Question 7

It is just syntactic sugar for functions of two arguments. a `f` b is equivalent to f a b.

Question 8

FYI, you might try using zipWith f [1..] (L.lines contents) instead of map (\ (a, b) -> foo) (zip [1..] (L.lines contents)).

Question 9

An obvious bottleneck is the conversion to String and back. Try changing the type signatures to

addRecordId :: L.ByteString -> String -> Int -> L.ByteString
addIncrementalId :: String -> [L.ByteString] -> [L.ByteString]
identifyFile :: FilePath -> L.ByteString -> IO [L.ByteString]

ByteString.Lazy.Char8 has its own lines, unlines and putStr which you can use, as well as cons or append for constructing your annotated line.

Piotr Miś Piotr Miś 2361 silver badge3 bronze badges · Accepted Answer · 2014-05-17 20:58:39Z

I think your program needs to be more Haskell-style (and shorter). Here is my rewrite:

import qualified Data.ByteString.Lazy.Char8 as L
processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines out
 where out = zipWith f [1..] (L.lines contents)
 sep = L.pack ";"
 f n l = l `L.append` sep `L.append` L.pack (show n)
main = do
 contents <- L.readFile "myfile.csv"
 L.putStr (processContents contents)

Need better speed? It is trivial to convert this code into parallel one.

import qualified Data.ByteString.Lazy.Char8 as L
import Control.Parallel.Strategies
import GHC.Conc(numCapabilities)
processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines (out `using` parListChunk chunks rdeepseq)
 where out = zipWith f [1..] (L.lines contents)
 sep = L.pack ";"
 f n l = l `L.append` sep `L.append` L.pack (show n)
 chunks = 1 + (length out `div` numCapabilities)
main = do
 contents <- L.readFile "myfile.csv"
 L.putStr (processContents contents)

To compile it use the following command:

ghc -O2 -threaded -with-rtsopts=-N program.hs

Your code take the same time (12 secondes). But waouh!!! I think I must take a look to your code to understand everything :)
I use a few functions from standard library. Nothing fancy. Btw, now it is trivial to further increase performance by using parallel map and therefore utilizing all cores. :-)
I started learning Haskell for one week by reading the real world haskell and currently, I don't know the meaning of the black quote. What is its purpose?
It is just syntactic sugar for functions of two arguments. a `f` b is equivalent to f a b.
FYI, you might try using zipWith f [1..] (L.lines contents) instead of map (\ (a, b) -> foo) (zip [1..] (L.lines contents)).

Stack Exchange Network

Processing text files for data extraction and analysis

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Processing text files for data extraction and analysis

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions