Parsing CSV file with BufferedReader vs Scanner

Question 1

I have to parse a file line-by-line and in single line I have split by ",". The first string is the name and the second is the count.

Finally, I have to display the Key and Count, for example:

Peter,2
Smith,3
Peter,3
Smith,5

I should display as Peter 5 and Smith 8.

I confused about choosing between BufferedReader vs Scanner. I went through this link and came up with these two approaches. I would like to get your reviews.

`BufferedReader`

private HashMap<String, MutableLong> readFile(File file) throws IOException {
 final HashMap<String, MutableLong> keyHolder = new HashMap<>();
 try (BufferedReader br = new BufferedReader(new InputStreamReader(
 new FileInputStream(file), "UTF-8"))) {
 for (String line; (line = br.readLine()) != null;) {
 // processing the line.
 final String[] keyContents = line
 .split(KeyCountExam.COMMA_DELIMETER);
 if (keyContents.length == 2) {
 final String keyName = keyContents[0];
 final long count = Long.parseLong(keyContents[1]);
 final MutableLong keyCount = keyHolder.get(keyName);
 if (keyCount != null) {
 keyCount.add(count);
 keyHolder.put(keyName, keyCount);
 } else {
 keyHolder.put(keyName, new MutableLong(count));
 }
 }
 }
 }
 return keyHolder;
}
private static final String COMMA_DELIMETER = ",";
private static volatile Pattern commaPattern = Pattern.compile(COMMA_DELIMETER);

I have used MutableLong, since I don't want to create a BigInteger each time. Again, it may be a very big file and I don't have control on how max a key can occur.

`Scanner` and two delimiters

private static final String LINE_SEPARATOR_PATTERN = "\r\n|[\n\r\u2028\u2029\u0085]";
private static final String LINE_PATTERN = ".*(" + LINE_SEPARATOR_PATTERN
 + ")|.+$";
private static volatile Pattern linePattern = Pattern.compile(LINE_PATTERN);

I have went through the hasNext in Scanner, and to me, there is no harm in switching the Pattern. And I believe from Java 7, Scanner does have a limited buffer and can be enough for this kind of file.

Does anyone prefer Approach 2 over Approach 1, or do we have any other option than this? I just did sop for testing purpose. Obviously the same code in approach 1 would replace here. Using split in Approach 1 would create multiple String instances, which can be avoided here, by scanning a char sequence.

private HashMap<String, BigInteger> readFileScanner(File file)
 throws IOException {
 final HashMap<String, BigInteger> keyHolder = new HashMap<>();
 try (Scanner br = new Scanner(file, "UTF-8")) {
 while (br.hasNext()) {
 br.useDelimiter(commaPattern);
 System.out.println(br.next());
 System.out.println(br.next());
 br.useDelimiter(linePattern);
 }
 }
 return keyHolder;
 }

Question 2

I think that splitting a string on commas is the worse approach for serious CSV parsing.

There is no single standard csv format, but RFC 4180 comes close to being a de facto standard. As described in Section 2, CSV fields may be double-quoted (and if so, a literal " would be represented as a pair of " characters). Simply splitting a line on commas does not give you the ability to ever implement support for interpreting quoting.

With a Scanner, you can use tools such as useDelimiter(Pattern) and findInLine(Pattern) to simultaneously look for " and , characters, giving you a chance to write a real parser based on the ABNF grammar in RFC 4180.

Of course, unless are deliberately reinventing-the-wheel, you should just use an existing library, such as Apache Commons CSV or opencsv.

Question 3

Thanks , i may be misguided by adding CSV in the Question. It is just comma Separated and no Quotes involved. I agreed that i should avoid reinvent and use opencsv / Apache Commons. Assume i am given only to use native API . in that case if i understood correctly you prefer to use Scanner and findInLine(Pattern)- correct me if i wrong

Question 4

I particularily agree with the suggestion to use the apache commons CSV library; after all, he already is using apache commons with the MutableLong anyway.

Question 5

SuperCSV is also a good choice.

Question 6

@Mani But MutableLong is also not part of native API.

Question 7

Which iterator to use
I believe it is a matter of taste which iterator to use, though I find the first option is better because:

It is more readable - you understand what is going on at every line - your readLine() each time, and then split() it. In the second option you keep changing the delimiter which is not very intuitive or readable, and the code reader would have to be familiar with Scanner to understand what is going on.
It is less brittle - in the first option you actually check the validity of your input (that it has two elements in each line). In the second option, you can't really do that, and if you get a corrupt file, you are going to get unexpected results.

Mutable Long?
I did not follow you explanation about MutableLong, for one thing since I can't see that it even supports BigInteger... Also, when in each line you have numbers in the range of 2 and 5, how big you expect the files to be? The max long number is 9,223,372,036,854,775,807...

Magic Constants
A constant like COMMA_DELIMETER = "," is not very useful. Using the literal is readable enough, and resolves any ambiguity it might have by hiding it in a constant:

keyContents = line.split(",");

Question 8

Agreed with magic Constants. Thanks for the your view on Option 1 and Option 2. I do even prefer the first one. Only thing i worried is , readLine would create new String and Split would create two more Strings . so for a line it would create 3 Strings. If the file is huge it would be expensive spending cycle in GC. How ever will it do always new String / intern from pool ( i dont know)

Question 9

And for the Mutable Long. Same Reason. BigInteger is not Mutable, so each time when the same key found it would create another one. That why went to Mutable Long. I could have used AtomicLong but it has more than what i needed. so i choosed MutableLong from apache it just wraps the long

Question 10

Scanner will create two strings too. Anyway native long what is wrong with it? It's 64bit and you avoid object overhead and the part to check if it's null. (P.S AtomicLong is a total different thing!)

Question 11

Yes. but i cant use native long since i have to keep the count in map (any storage) . so i have to use Object wrapped by long any way

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Answer 1 · 2014-04-06 17:56:11Z

I think that splitting a string on commas is the worse approach for serious CSV parsing.

There is no single standard csv format, but RFC 4180 comes close to being a de facto standard. As described in Section 2, CSV fields may be double-quoted (and if so, a literal " would be represented as a pair of " characters). Simply splitting a line on commas does not give you the ability to ever implement support for interpreting quoting.

With a Scanner, you can use tools such as useDelimiter(Pattern) and findInLine(Pattern) to simultaneously look for " and , characters, giving you a chance to write a real parser based on the ABNF grammar in RFC 4180.

Of course, unless are deliberately reinventing-the-wheel, you should just use an existing library, such as Apache Commons CSV or opencsv.

Thanks , i may be misguided by adding CSV in the Question. It is just comma Separated and no Quotes involved. I agreed that i should avoid reinvent and use opencsv / Apache Commons. Assume i am given only to use native API . in that case if i understood correctly you prefer to use Scanner and findInLine(Pattern)- correct me if i wrong
I particularily agree with the suggestion to use the apache commons CSV library; after all, he already is using apache commons with the MutableLong anyway.

Uri Agassi Uri Agassi 6,7051 gold badge18 silver badges48 bronze badges · Answer 2 · 2014-04-06 17:19:53Z

Which iterator to use
I believe it is a matter of taste which iterator to use, though I find the first option is better because:

It is more readable - you understand what is going on at every line - your readLine() each time, and then split() it. In the second option you keep changing the delimiter which is not very intuitive or readable, and the code reader would have to be familiar with Scanner to understand what is going on.
It is less brittle - in the first option you actually check the validity of your input (that it has two elements in each line). In the second option, you can't really do that, and if you get a corrupt file, you are going to get unexpected results.

Mutable Long?
I did not follow you explanation about MutableLong, for one thing since I can't see that it even supports BigInteger... Also, when in each line you have numbers in the range of 2 and 5, how big you expect the files to be? The max long number is 9,223,372,036,854,775,807...

Magic Constants
A constant like COMMA_DELIMETER = "," is not very useful. Using the literal is readable enough, and resolves any ambiguity it might have by hiding it in a constant:

keyContents = line.split(",");

Agreed with magic Constants. Thanks for the your view on Option 1 and Option 2. I do even prefer the first one. Only thing i worried is , readLine would create new String and Split would create two more Strings . so for a line it would create 3 Strings. If the file is huge it would be expensive spending cycle in GC. How ever will it do always new String / intern from pool ( i dont know)
And for the Mutable Long. Same Reason. BigInteger is not Mutable, so each time when the same key found it would create another one. That why went to Mutable Long. I could have used AtomicLong but it has more than what i needed. so i choosed MutableLong from apache it just wraps the long
Scanner will create two strings too. Anyway native long what is wrong with it? It's 64bit and you avoid object overhead and the part to check if it's null. (P.S AtomicLong is a total different thing!)
Yes. but i cant use native long since i have to keep the count in map (any storage) . so i have to use Object wrapped by long any way

Stack Exchange Network

Parsing CSV file with BufferedReader vs Scanner

`BufferedReader`

`Scanner` and two delimiters

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing CSV file with BufferedReader vs Scanner

BufferedReader

Scanner and two delimiters

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`BufferedReader`

`Scanner` and two delimiters