I have to parse a file line-by-line and in single line I have split by ",". The first string is the name and the second is the count.
Finally, I have to display the Key and Count, for example:
Peter,2 Smith,3 Peter,3 Smith,5
I should display as Peter 5 and Smith 8.
I confused about choosing between BufferedReader
vs Scanner
. I went through this link and came up with these two approaches. I would like to get your reviews.
BufferedReader
private HashMap<String, MutableLong> readFile(File file) throws IOException { final HashMap<String, MutableLong> keyHolder = new HashMap<>(); try (BufferedReader br = new BufferedReader(new InputStreamReader( new FileInputStream(file), "UTF-8"))) { for (String line; (line = br.readLine()) != null;) { // processing the line. final String[] keyContents = line .split(KeyCountExam.COMMA_DELIMETER); if (keyContents.length == 2) { final String keyName = keyContents[0]; final long count = Long.parseLong(keyContents[1]); final MutableLong keyCount = keyHolder.get(keyName); if (keyCount != null) { keyCount.add(count); keyHolder.put(keyName, keyCount); } else { keyHolder.put(keyName, new MutableLong(count)); } } } } return keyHolder; } private static final String COMMA_DELIMETER = ","; private static volatile Pattern commaPattern = Pattern.compile(COMMA_DELIMETER);
I have used
MutableLong
, since I don't want to create aBigInteger
each time. Again, it may be a very big file and I don't have control on how max a key can occur.Scanner
and two delimitersprivate static final String LINE_SEPARATOR_PATTERN = "\r\n|[\n\r\u2028\u2029\u0085]"; private static final String LINE_PATTERN = ".*(" + LINE_SEPARATOR_PATTERN + ")|.+$"; private static volatile Pattern linePattern = Pattern.compile(LINE_PATTERN);
I have went through the hasNext
in Scanner
, and to me, there is no harm in switching the Pattern
. And I believe from Java 7, Scanner
does have a limited buffer and can be enough for this kind of file.
Does anyone prefer Approach 2 over Approach 1, or do we have any other option than this? I just did sop for testing purpose. Obviously the same code in approach 1 would replace here. Using split
in Approach 1 would create multiple String
instances, which can be avoided here, by scanning a char
sequence.
private HashMap<String, BigInteger> readFileScanner(File file)
throws IOException {
final HashMap<String, BigInteger> keyHolder = new HashMap<>();
try (Scanner br = new Scanner(file, "UTF-8")) {
while (br.hasNext()) {
br.useDelimiter(commaPattern);
System.out.println(br.next());
System.out.println(br.next());
br.useDelimiter(linePattern);
}
}
return keyHolder;
}
2 Answers 2
I think that splitting a string on commas is the worse approach for serious CSV parsing.
There is no single standard csv format, but RFC 4180 comes close to being a de facto standard. As described in Section 2, CSV fields may be double-quoted (and if so, a literal "
would be represented as a pair of "
characters). Simply splitting a line on commas does not give you the ability to ever implement support for interpreting quoting.
With a Scanner
, you can use tools such as useDelimiter(Pattern)
and findInLine(Pattern)
to simultaneously look for "
and ,
characters, giving you a chance to write a real parser based on the ABNF grammar in RFC 4180.
Of course, unless are deliberately reinventing-the-wheel, you should just use an existing library, such as Apache Commons CSV or opencsv.
-
\$\begingroup\$ Thanks , i may be misguided by adding CSV in the Question. It is just comma Separated and no Quotes involved. I agreed that i should avoid reinvent and use opencsv / Apache Commons. Assume i am given only to use native API . in that case if i understood correctly you prefer to use Scanner and findInLine(Pattern)- correct me if i wrong \$\endgroup\$Mani– Mani2014年04月06日 18:02:51 +00:00Commented Apr 6, 2014 at 18:02
-
2\$\begingroup\$ I particularily agree with the suggestion to use the apache commons CSV library; after all, he already is using apache commons with the
MutableLong
anyway. \$\endgroup\$AJMansfield– AJMansfield2014年04月06日 18:20:05 +00:00Commented Apr 6, 2014 at 18:20 -
-
\$\begingroup\$ @Mani But MutableLong is also not part of native API. \$\endgroup\$convert– convert2022年02月09日 14:26:59 +00:00Commented Feb 9, 2022 at 14:26
Which iterator to use
I believe it is a matter of taste which iterator to use, though I find the first option is better because:
- It is more readable - you understand what is going on at every line - your
readLine()
each time, and thensplit()
it. In the second option you keep changing the delimiter which is not very intuitive or readable, and the code reader would have to be familiar withScanner
to understand what is going on. - It is less brittle - in the first option you actually check the validity of your input (that it has two elements in each line). In the second option, you can't really do that, and if you get a corrupt file, you are going to get unexpected results.
Mutable Long?
I did not follow you explanation about MutableLong
, for one thing since I can't see that it even supports BigInteger
... Also, when in each line you have numbers in the range of 2
and 5
, how big you expect the files to be? The max long
number is 9,223,372,036,854,775,807
...
Magic Constants
A constant like COMMA_DELIMETER = ","
is not very useful. Using the literal is readable enough, and resolves any ambiguity it might have by hiding it in a constant:
keyContents = line.split(",");
-
\$\begingroup\$ Agreed with magic Constants. Thanks for the your view on Option 1 and Option 2. I do even prefer the first one. Only thing i worried is , readLine would create new String and Split would create two more Strings . so for a line it would create 3 Strings. If the file is huge it would be expensive spending cycle in GC. How ever will it do always new String / intern from pool ( i dont know) \$\endgroup\$Mani– Mani2014年04月06日 17:27:02 +00:00Commented Apr 6, 2014 at 17:27
-
\$\begingroup\$ And for the Mutable Long. Same Reason. BigInteger is not Mutable, so each time when the same key found it would create another one. That why went to Mutable Long. I could have used AtomicLong but it has more than what i needed. so i choosed MutableLong from apache it just wraps the long \$\endgroup\$Mani– Mani2014年04月06日 17:30:52 +00:00Commented Apr 6, 2014 at 17:30
-
1\$\begingroup\$ Scanner will create two strings too. Anyway native long what is wrong with it? It's 64bit and you avoid object overhead and the part to check if it's null. (P.S AtomicLong is a total different thing!) \$\endgroup\$Marco Acierno– Marco Acierno2014年04月06日 17:38:04 +00:00Commented Apr 6, 2014 at 17:38
-
\$\begingroup\$ Yes. but i cant use native long since i have to keep the count in map (any storage) . so i have to use Object wrapped by long any way \$\endgroup\$Mani– Mani2014年04月06日 17:41:41 +00:00Commented Apr 6, 2014 at 17:41
Explore related questions
See similar questions with these tags.