I started learning Haskell to see if I can use it at my job. A lot of my work is processing text files for data extraction and analysis.
For my first test, I added a counter at the end of each line from a .csv text file (currently I don't care about the format management).
My current code in Haskell is:
import qualified Data.ByteString.Lazy.Char8 as L
addRecordId :: String -> String -> Int -> String
addRecordId "" _ _ = ""
addRecordId rec sep cnt = rec ++ sep ++ show cnt
addIncrementalId :: String -> [String] -> [String]
addIncrementalId _ [] = []
addIncrementalId sep ls = addId ls 1
where
addId [] _ = []
addId (l:ls) cnt = addRecordId l sep cnt : addId ls (cnt + 1)
identifyFile :: FilePath -> String -> IO [String]
identifyFile path sep = do
inpStr <- L.readFile path
return (addIncrementalId sep (lines (L.unpack inpStr)))
printLnIdentifiedFile :: IO [String] -> IO ()
printLnIdentifiedFile ls = do
lines <- ls
putStr (unlines lines)
main = printLnIdentifiedFile (identifyFile "myfile.csv" ";")
This code processes a file of 1GB (4,845,000 records) in 90 seconds.
This C code below does the same job in 10 seconds:
#include <stdio.h>
#include <stdlib.h>
int main() {
FILE *f = fopen("myfile.csv", "r");
size_t bytes_read;
size_t current_buffer_size = 400;
char *buffer = calloc(current_buffer_size, 1);
long cnt = 1;
while ((bytes_read = getline(&buffer, ¤t_buffer_size, f)) > 0) {
if (feof(f)) break;
buffer[ bytes_read - 2 ] = 0;
printf("%s;%ld\n", buffer, cnt++);
}
fclose(f);
return 0;
}
And the Java code below does the job in 30 seconds:
package test.perf.numadr;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
public class NumAdr {
static public void main(String[] args) {
BufferedReader br = null;
try {
String sCurrentLine;
br = new BufferedReader(new FileReader("myfile.csv"));
int cnt = 1;
String lineWithId;
while ((sCurrentLine = br.readLine()) != null) {
lineWithId = sCurrentLine + ";" + cnt;
cnt++;
System.out.println(lineWithId);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null)br.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
}
For each test, I print the result to stdout
and redirect to a result file.
My Haskell code is 3 times slower than the Java code and 9 times than the C code. As I'm a beginner in Haskell, I think my Haskell code is not the best.
How can I improve my program?
2 Answers 2
I think your program needs to be more Haskell-style (and shorter). Here is my rewrite:
import qualified Data.ByteString.Lazy.Char8 as L
processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines out
where out = zipWith f [1..] (L.lines contents)
sep = L.pack ";"
f n l = l `L.append` sep `L.append` L.pack (show n)
main = do
contents <- L.readFile "myfile.csv"
L.putStr (processContents contents)
Need better speed? It is trivial to convert this code into parallel one.
import qualified Data.ByteString.Lazy.Char8 as L
import Control.Parallel.Strategies
import GHC.Conc(numCapabilities)
processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines (out `using` parListChunk chunks rdeepseq)
where out = zipWith f [1..] (L.lines contents)
sep = L.pack ";"
f n l = l `L.append` sep `L.append` L.pack (show n)
chunks = 1 + (length out `div` numCapabilities)
main = do
contents <- L.readFile "myfile.csv"
L.putStr (processContents contents)
To compile it use the following command:
ghc -O2 -threaded -with-rtsopts=-N program.hs
-
1\$\begingroup\$ Your code take the same time (12 secondes). But waouh!!! I think I must take a look to your code to understand everything :) \$\endgroup\$ChriX– ChriX2014年05月17日 21:05:12 +00:00Commented May 17, 2014 at 21:05
-
\$\begingroup\$ I use a few functions from standard library. Nothing fancy. Btw, now it is trivial to further increase performance by using parallel map and therefore utilizing all cores. :-) \$\endgroup\$uraf– uraf2014年05月17日 21:10:17 +00:00Commented May 17, 2014 at 21:10
-
\$\begingroup\$ I started learning Haskell for one week by reading the real world haskell and currently, I don't know the meaning of the black quote. What is its purpose? \$\endgroup\$ChriX– ChriX2014年05月17日 21:18:44 +00:00Commented May 17, 2014 at 21:18
-
1\$\begingroup\$ It is just syntactic sugar for functions of two arguments.
a `f` b
is equivalent tof a b
. \$\endgroup\$uraf– uraf2014年05月17日 21:20:00 +00:00Commented May 17, 2014 at 21:20 -
1\$\begingroup\$ FYI, you might try using
zipWith f [1..] (L.lines contents)
instead ofmap (\ (a, b) -> foo) (zip [1..] (L.lines contents))
. \$\endgroup\$Louis Wasserman– Louis Wasserman2014年05月19日 06:37:13 +00:00Commented May 19, 2014 at 6:37
An obvious bottleneck is the conversion to String
and back. Try changing the type signatures to
addRecordId :: L.ByteString -> String -> Int -> L.ByteString
addIncrementalId :: String -> [L.ByteString] -> [L.ByteString]
identifyFile :: FilePath -> L.ByteString -> IO [L.ByteString]
ByteString.Lazy.Char8 has its own lines
, unlines
and putStr
which you can use, as well as cons
or append
for constructing your annotated line.
Explore related questions
See similar questions with these tags.
sed
,awk
andperl
, in order of increasing complexity. For instance—if I understand correctly that all this code does is append line numbers after a semicolon, you can do the same withsed = filename | sed -n 'h;n;G;s/\n/;/;p' > outputfile
.sed
could be called the assembly language of text processing. \$\endgroup\$