I am trying to read a single-columned CSV of doubles into Java
with a string header. It is 11 megabytes and takes over 15 minutes to read, which is clearly unacceptable. In R
this CSV would take about 3 seconds to load.
This CSV file may contain strings so I am parsing it with this in mind.
The CSV reading method needs to return Vector<Double>
due to reliance on this output by other parts of the application.
The issue is not due to the isNumber
static method, since each call to that is taking 200 nanoseconds
, thus contributing approximately 0.2 seconds to the 15 minutes of parsing time.
The Double.valueOf()
only takes about 500 nanoseconds, so it is not that either.
csvData.add()
is only taking 80 nanoseconds so it is not that.
private static Vector<Double> readTXTFileSingle(String csvFileName) throws IOException {
String line = null;
BufferedReader stream = null;
Vector<Double> csvData = new Vector<Double>();
try {
stream = new BufferedReader(new FileReader(csvFileName));
while ((line = stream.readLine()) != null) {
String[] splitted = line.split(",");
if( ! NumberUtils.isNumber(splitted[0])) {
continue;
}
Double dataLine = Double.valueOf(splitted[0]);
csvData.add(dataLine);
}
} finally {
if (stream != null)
stream.close();
}
return csvData;
}
2 Answers 2
As per Bobby's comment, Vector is your problem, but not for the reason he says...
Vector is a synchronized class. Each call to any method on Vector will lock the thread, flush all cache lines, and generally waste a lot of time (in a situation where it's usage is in a single thread only).
The fact that you use Vector indicates that you are running some really old code, or you have not properly read the JavaDoc for it.
A secondary performance problem is that each value is being converted to a Double
Object. In cases where you have large amounts of data, and where there is a primitive available for you to use, it is always faster to use the primitive (in this case, double
instead of Double
).
You should also be using the Java7 try-with-resources mechanism for your stream
.
My recommendation is to change the signature of your method to return a List... actually, no, my recommendation is to return an array of primitive double[]
.... if you are interested in speed, this will be a significant improvement:
private static double[] readTXTFileSingle(String csvFileName) throws IOException {
double[] csvData = new double[4096]; // arbitrary starting size.
int dcnt = 0;
try (BufferedReader stream = new BufferedReader(new FileReader(csvFileName))) {
String line = null;
while ((line = stream.readLine()) != null) {
String[] splitted = line.split(",");
if( ! NumberUtils.isNumber(splitted[0])) {
continue;
}
double dataLine = Double.parseDouble(splitted[0]);
if (dcnt >= csvData.length) {
// add 50% to array size.
csvData = Arrays.copyOf(csvData, dcnt + (dcnt / 2));
}
csvData[dcnt++] = dataLine;
}
}
return Arrays.copyOf(csvData, dcnt);
}
Edit:
One other thing, if you want another tweak in performance, use:
String[] splitted = line.split(",", 2);
since you never access more than the first field in the record, you do not need to look for comma's beyond the first comma
-
2\$\begingroup\$ "Vector is a synchronized class." Ah, now I understand that odd "You shouldn't use that but I can't remember why" feeling I had. \$\endgroup\$Bobby– Bobby2014年02月04日 15:32:17 +00:00Commented Feb 4, 2014 at 15:32
-
1\$\begingroup\$ I doubt
Vector
's synchronization is responsible for a 15 minute run on an 11M file. I've tried this code (withVector
in place) on a local file and it consistently returned in under 1 second. \$\endgroup\$bowmore– bowmore2014年02月05日 07:42:50 +00:00Commented Feb 5, 2014 at 7:42 -
\$\begingroup\$ @bowmore you are right... by the way, did you run it with a pre-set vector size or did you let it grow? The OP must have something else going on \$\endgroup\$rolfl– rolfl2014年02月05日 11:20:07 +00:00Commented Feb 5, 2014 at 11:20
-
\$\begingroup\$ I've used the code as is in OP. I only supplied a very naive
isNumber()
implementation myself by basically callingDouble.parseDouble()
and catching the exception. \$\endgroup\$bowmore– bowmore2014年02月05日 12:06:11 +00:00Commented Feb 5, 2014 at 12:06 -
\$\begingroup\$ @bowmore: IsNumber() could be one of the problems, to me it looks rather heavy weight. \$\endgroup\$Bobby– Bobby2014年02月05日 15:04:39 +00:00Commented Feb 5, 2014 at 15:04
You could utilize the open source library uniVocity-parsers to parse csv data to vector of doubles, as the library provides excellent performance with multi-threading, caching and optimized code.
Try the following lines of code with the help of this library:
private static Vector<Double> readTXTFileSingle(String csvFileName) throws IOException {
CsvParser parser = new CsvParser(new CsvParserSettings());
List<String[]> resolvedData = parser.parseAll(new FileReader(csvFileName));
Vector<Double> csvData = new Vector<Double>();
for (String[] row : resolvedData) {
if (!NumberUtils.isNumber(row[0])) {
continue;
}
csvData.add(Double.valueOf(row[0]));
}
return csvData;
}
NumberUtils.isNumber
implemented? \$\endgroup\$