Speed up CSV reading code (vector of doubles)

Question 1

I am trying to read a single-columned CSV of doubles into Java with a string header. It is 11 megabytes and takes over 15 minutes to read, which is clearly unacceptable. In R this CSV would take about 3 seconds to load.

This CSV file may contain strings so I am parsing it with this in mind.

The CSV reading method needs to return Vector<Double> due to reliance on this output by other parts of the application.

The issue is not due to the isNumber static method, since each call to that is taking 200 nanoseconds, thus contributing approximately 0.2 seconds to the 15 minutes of parsing time.

The Double.valueOf() only takes about 500 nanoseconds, so it is not that either.

csvData.add() is only taking 80 nanoseconds so it is not that.

private static Vector<Double> readTXTFileSingle(String csvFileName) throws IOException {
 String line = null;
 BufferedReader stream = null;
 Vector<Double> csvData = new Vector<Double>();
 try {
 stream = new BufferedReader(new FileReader(csvFileName));
 while ((line = stream.readLine()) != null) {
 String[] splitted = line.split(",");
 if( ! NumberUtils.isNumber(splitted[0])) {
 continue;
 }
 Double dataLine = Double.valueOf(splitted[0]);
 csvData.add(dataLine);
 }
 } finally {
 if (stream != null)
 stream.close();
 }
 return csvData;
}

Question 2

How is NumberUtils.isNumber implemented?

Question 3

@SimonAndréForsberg This is responsible for 200 milliseconds of the lag (200 nanoseconds per call, 1,000,000 times). So it is not the problem

Question 4

I'd still like to see it, if you're willing to provide it.

Question 5

Wait a moment, you profiled it but do not know where it is slow?

Question 6

Run a profiler on it...I mean, yes, we are doing reviews for performance improvements, but we do not do guess work. Please run it through a profiler and see what takes that long, then add that information to the question.

Question 7

As per Bobby's comment, Vector is your problem, but not for the reason he says...

Vector is a synchronized class. Each call to any method on Vector will lock the thread, flush all cache lines, and generally waste a lot of time (in a situation where it's usage is in a single thread only).

The fact that you use Vector indicates that you are running some really old code, or you have not properly read the JavaDoc for it.

A secondary performance problem is that each value is being converted to a Double Object. In cases where you have large amounts of data, and where there is a primitive available for you to use, it is always faster to use the primitive (in this case, double instead of Double).

You should also be using the Java7 try-with-resources mechanism for your stream.

My recommendation is to change the signature of your method to return a List... actually, no, my recommendation is to return an array of primitive double[].... if you are interested in speed, this will be a significant improvement:

private static double[] readTXTFileSingle(String csvFileName) throws IOException {
 double[] csvData = new double[4096]; // arbitrary starting size.
 int dcnt = 0;
 try (BufferedReader stream = new BufferedReader(new FileReader(csvFileName))) {
 String line = null;
 while ((line = stream.readLine()) != null) {
 String[] splitted = line.split(",");
 if( ! NumberUtils.isNumber(splitted[0])) {
 continue;
 }
 double dataLine = Double.parseDouble(splitted[0]);
 if (dcnt >= csvData.length) {
 // add 50% to array size.
 csvData = Arrays.copyOf(csvData, dcnt + (dcnt / 2));
 }
 csvData[dcnt++] = dataLine;
 }
 }
 return Arrays.copyOf(csvData, dcnt);
}

Edit:

One other thing, if you want another tweak in performance, use:

String[] splitted = line.split(",", 2);

since you never access more than the first field in the record, you do not need to look for comma's beyond the first comma

Question 8

"Vector is a synchronized class." Ah, now I understand that odd "You shouldn't use that but I can't remember why" feeling I had.

Question 9

I doubt Vector's synchronization is responsible for a 15 minute run on an 11M file. I've tried this code (with Vector in place) on a local file and it consistently returned in under 1 second.

Question 10

@bowmore you are right... by the way, did you run it with a pre-set vector size or did you let it grow? The OP must have something else going on

Question 11

I've used the code as is in OP. I only supplied a very naive isNumber() implementation myself by basically calling Double.parseDouble() and catching the exception.

Question 12

@bowmore: IsNumber() could be one of the problems, to me it looks rather heavy weight.

Question 13

You could utilize the open source library uniVocity-parsers to parse csv data to vector of doubles, as the library provides excellent performance with multi-threading, caching and optimized code.

Try the following lines of code with the help of this library:

private static Vector<Double> readTXTFileSingle(String csvFileName) throws IOException {
 CsvParser parser = new CsvParser(new CsvParserSettings());
 List<String[]> resolvedData = parser.parseAll(new FileReader(csvFileName));
 Vector<Double> csvData = new Vector<Double>();
 for (String[] row : resolvedData) {
 if (!NumberUtils.isNumber(row[0])) {
 continue;
 }
 csvData.add(Double.valueOf(row[0]));
 }
 return csvData;
}

rolfl rolfl 98.1k17 gold badges219 silver badges419 bronze badges · Accepted Answer · 2014-02-04 15:10:05Z

As per Bobby's comment, Vector is your problem, but not for the reason he says...

Vector is a synchronized class. Each call to any method on Vector will lock the thread, flush all cache lines, and generally waste a lot of time (in a situation where it's usage is in a single thread only).

The fact that you use Vector indicates that you are running some really old code, or you have not properly read the JavaDoc for it.

A secondary performance problem is that each value is being converted to a Double Object. In cases where you have large amounts of data, and where there is a primitive available for you to use, it is always faster to use the primitive (in this case, double instead of Double).

You should also be using the Java7 try-with-resources mechanism for your stream.

My recommendation is to change the signature of your method to return a List... actually, no, my recommendation is to return an array of primitive double[].... if you are interested in speed, this will be a significant improvement:

private static double[] readTXTFileSingle(String csvFileName) throws IOException {
 double[] csvData = new double[4096]; // arbitrary starting size.
 int dcnt = 0;
 try (BufferedReader stream = new BufferedReader(new FileReader(csvFileName))) {
 String line = null;
 while ((line = stream.readLine()) != null) {
 String[] splitted = line.split(",");
 if( ! NumberUtils.isNumber(splitted[0])) {
 continue;
 }
 double dataLine = Double.parseDouble(splitted[0]);
 if (dcnt >= csvData.length) {
 // add 50% to array size.
 csvData = Arrays.copyOf(csvData, dcnt + (dcnt / 2));
 }
 csvData[dcnt++] = dataLine;
 }
 }
 return Arrays.copyOf(csvData, dcnt);
}

Edit:

One other thing, if you want another tweak in performance, use:

String[] splitted = line.split(",", 2);

since you never access more than the first field in the record, you do not need to look for comma's beyond the first comma

"Vector is a synchronized class." Ah, now I understand that odd "You shouldn't use that but I can't remember why" feeling I had.
I doubt Vector's synchronization is responsible for a 15 minute run on an 11M file. I've tried this code (with Vector in place) on a local file and it consistently returned in under 1 second.
@bowmore you are right... by the way, did you run it with a pre-set vector size or did you let it grow? The OP must have something else going on
I've used the code as is in OP. I only supplied a very naive isNumber() implementation myself by basically calling Double.parseDouble() and catching the exception.
@bowmore: IsNumber() could be one of the problems, to me it looks rather heavy weight.

Stack Exchange Network

Speed up CSV reading code (vector of doubles)

2 Answers 2

Edit:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Speed up CSV reading code (vector of doubles)

2 Answers 2

Edit:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions