Please take a look at my method for parsing a CSV string.
I am looking for a simple approach without using an external library.
Is throwing a RuntimeException
inside the stream appropriate?
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class MyCsvParser {
public static String[][] parse(final String str, final int cols) {
List<String[]> list = new ArrayList<>();
str.lines().forEach(l -> {
String[] split = l.split(",");
if (split.length != cols) {
throw new RuntimeException();
}
list.add(split);
});
return list.toArray(new String[0][]);
}
public static void main(String[] args) {
System.out.println(Arrays.deepToString(parse("bla,blaa\nblub,blubb", 2)));
}
}
4 Answers 4
Side-effects
Stream.forEach()
operation should be utilized with care since it operates via side-effects and should not be used as a substitution of a proper reduction operation.
The way you've written this stream is discouraged by the Stream API documentation because it makes code more cluttered and difficult to follow and more importantly your solution is broken with parallel streams (there should be no assumptions on the nature of the stream in your code).
In this particular case, you should be using Stream.toArray()
operation instead.
Multiline lambdas
Try to void them. They bear a lot of cognitive load. If a lambda expression requires several lines or when you have a complex single-line lambda (e.g. with a nested stream in it), consider introducing a method.
Exceptions
In short, the purpose of exceptions is to indicate cases when it's not possible to proceed with the normal execution flow.
If you stumbled on a corrupt piece of data which violates invariant that are important for your business logic, usually you don't want to proceed processing it. That's a valid case to throw. You asked can you "throw from a stream"? Sure, it's just a means iteration.
I've seen some bickering over whether it's appropriate to use exceptions for the purpose of validation. Sure it is, we do employ Exceptions for Validation for decades.
Unless you're using exceptions to avoid conditional logic, or to make weird hacks like throwing in order to break from a recursive method, and you have a genuine invalid piece of data on your hands you can and should throw.
Another, important note: exceptions should be informative. If standard exception types can describe the case at hand, fine, if not introduce your own exception type.
Also, use proper exception messages that will be helpful in investigating the issue.
Static routines
Don't treat everything as util classes, use the Power of object-orientation to make more clean, cohesive and testable.
Refactored version
public class ArrayParser {
private final String separator;
private final int columnCount;
public ArrayParser(String separator, int columnCount) {
this.separator = separator;
this.columnCount = columnCount;
}
public String[][] parse(final String str) {
return str.lines()
.map(this::parseLine)
.toArray(String[][]::new);
}
private String[] parseLine(String toParse) {
String[] line = toParse.split(separator);
validateLine(line);
return line;
}
private void validateLine(String[] line) {
if (line.length != columnCount) {
throw new LineParsingException(line, columnCount);
}
}
}
Exception example:
private class LineParsingException extends RuntimeException {
private static final String MESSAGE_TEMPLATE = """
The actual number of columns in the line
%s
doesn't match the expected number of columns %d""";
public LineParsingException(String[] line, int columnsExpected) {
super(MESSAGE_TEMPLATE.formatted(Arrays.toString(line), columnsExpected));
}
}
-
1\$\begingroup\$ @tgrothe
ParseException
has a couple of downsides: it's a checked exception and it's very general. It's common practice to avoid checked exceptions while describing cases of abnormal behavior in your domain space because they lead to leaky abstractions and pollute your code. Some people here might disagree, but it's a general consensus is that introduction of checked exceptions was an unsuccessful experiment in language modeling (that's the opinion expressed by Java Champions and established Java community members). \$\endgroup\$Alexander Ivanchenko– Alexander Ivanchenko2024年09月15日 11:25:36 +00:00Commented Sep 15, 2024 at 11:25 -
1\$\begingroup\$ @tgrothe
columnLength
this name is misleading, it might suggest a column is expected to contain a certain number of characters, or something along these lines. Something likecolumnCount
will do a better job at communicating that the purpose of the field is to describe the expected number columns. \$\endgroup\$Alexander Ivanchenko– Alexander Ivanchenko2024年09月15日 11:32:19 +00:00Commented Sep 15, 2024 at 11:32 -
2\$\begingroup\$
ArrayParser
=> Ach Du Lieber! What such a anti-PoOO class name. The eponymous method is tolerable only because it a constructor. Double-plus good for having one. From the last few years of CodeReview reviewing I thought constructors were banned. PoOO is merely a way to quality code. The WAY is shut without constructors. P.S. plus one. \$\endgroup\$radarbob– radarbob2024年09月15日 20:37:44 +00:00Commented Sep 15, 2024 at 20:37 -
1\$\begingroup\$ @radarbob And if an instance of a class requires external data for initialization, it should be provided explicitly through a constructor (not via setters). Because it's a responsibility of constructors to produce a fully initialized class instance. \$\endgroup\$Alexander Ivanchenko– Alexander Ivanchenko2024年09月18日 02:14:09 +00:00Commented Sep 18, 2024 at 2:14
-
1\$\begingroup\$ provided explicitly through a constructor. Yes. I always imagine some coder who is not me using that class. If you're not constructor-ing in these simple cases what will happen when it's a hierarchal composition of a dozen related classes. Makes all the effort of evolving beyond assembler seem like a waste. Well, I still have my IBM 360 assembler coding card just in case. \$\endgroup\$radarbob– radarbob2024年09月18日 07:51:16 +00:00Commented Sep 18, 2024 at 7:51
conservative design
Since this is billed as "a CSV parser", a caller may reasonably
believe they could send it any *.csv
file produced by Excel.
Better to advertise it as MyRestrictedCsvParser.
The /** javadoc */ comments should explain the restrictions.
- Each field may or may not be enclosed in double quotes
This library should probably throw a fatal error upon
encountering an ASCII 34 "
double quote anywhere in an input line.
Then a caller would not accidentally consume a data file in the
belief that it had been parsed one way when in fact the library
parsed it another way.
That is, part of scoping down requirements is
reducing the space of inputs you're willing to claim you successfully processed.
informative diagnostic
Throwing an unchecked exception within the JVM is great. It makes your library easier for callers to consume.
throw new RuntimeException();
This is not a very diagnostic error. It needs two improvements:
- Subclass RuntimeException to create a library-specific error, perhaps CsvParseException.
- Mention the values of
split.length
andcols
in the message, to save a maintenance engineer a little effort in diagnosing and repairing buggy inputs.
Consider keeping track of which line number we're on, so that can be included in the diagnostic message.
A caller should not be forced to catch a generic RuntimeException to recover from an error it knows how to deal with. We define new app-specific exception types to permit fine-grained catching. Lumping "wrong column count", "found a quote", and "zero lines" together would be acceptable, at least until you see how callers actually behave. If it turns out that callers really do wish to distinguish between those errors, then a v2 library release could always offer finer granularity on the error types.
signature
Clearly the OP code works.
It seems slightly less convenient for the caller than it might be.
There is redundant information encoded in the str
and cols
parameters.
Consider setting cols
based on number of fields found in the first line of input.
-
1\$\begingroup\$ It is true, I assumed strings without double quotes at all. This deliberately contradicts point 5 of the standard or specification, and should be mentioned in a Java doc. \$\endgroup\$Tobias Grothe– Tobias Grothe2024年09月16日 08:48:48 +00:00Commented Sep 16, 2024 at 8:48
It is fine to raise runtime exception, but it is more helpful to add a specific exception message. I would suggest small simplification removing ArrayList
, replacing the forEach
with map
, and stream directly to array:
public static String[][] parse(final String str, final int cols) {
return str.lines().map(line -> line.split(","))
.map(sa -> {
if (sa.length != cols) {
throw new IllegalArgumentException("Mismatched column size: "+Arrays.deepToString(sa));
}
return sa;
}).toArray(String[][]::new);
}
Adapting the above to return Stream<String[]>
as suggested by user286335 will make memory use easier if you read huge input files, as you can process while reading rather than afterwards.
You could also parse without pre-defining the expected column count by addition of a groupingBy
collector, and raise exception when detecting inconsistent number of columns across all rows:
public static String[][] parse(final String str) {
Map<Integer, List<String[]>> map = str.lines().map(line -> line.split(","))
.collect(Collectors.groupingBy(sa -> Integer.valueOf(sa.length)));
// Could handle empty data here, or treat as error as below
// if (map.size() == 0) return new String[0][];
if (map.size() != 1)
throw new IllegalArgumentException("No data or mismatched row sizes: "+map.keySet());
// Return the first (and only) entry, all rows have same number of columns:
return map.values().iterator().next().toArray(new String[0][]);
}
-
1\$\begingroup\$ @AlexanderIvanchenko Good idea, I overlooked the default groupingBy behaviour, amended \$\endgroup\$DuncG– DuncG2024年09月21日 07:43:39 +00:00Commented Sep 21, 2024 at 7:43
Without throwing an exception for a different number of columns than the give cols
bound the code could get less complex to...
public static String[][] parseIt(final String str, final int cols) {
return str.lines()
.map(line -> line.split(","))
.filter(columns -> columns.length == cols)
.toArray(String[][]::new);
}
...if ignoring the incomplete or exceeded number of columns is an option, otherwise adjust the filtering to throw an exception in case of unfitted number of columns.
Returning an array of arrays of String
bounds the rest of the code using the method to convert the String[][]
into an Iterator
or a Stream
to go over its content, returning a Stream
could improve the overall complexity and reduce the code lines of parse
method...
public static Stream<String[]> parse(final String str, final int cols) {
return str.lines()
.map(line -> line.split(","))
.filter(columns -> columns.length == cols);
}
-
4\$\begingroup\$ "Without throwing an exception for a different number of columns than the give cols bound the code could get less complex" - you conflate altering the requirements and reducing code complexity. \$\endgroup\$Alexander Ivanchenko– Alexander Ivanchenko2024年09月15日 14:43:25 +00:00Commented Sep 15, 2024 at 14:43
-
2\$\begingroup\$ Not giving any indication that the given data is corrupt and silently processing the only records with the valid number of columns is a different problem from the one that OP's code solves. \$\endgroup\$Alexander Ivanchenko– Alexander Ivanchenko2024年09月15日 14:48:36 +00:00Commented Sep 15, 2024 at 14:48
-
\$\begingroup\$ But ... returning a
Stream<String[]>
instead of the entire string array is a valid point. Even if, strictly spoken, this has changed the requirements or problem (also) a bit. \$\endgroup\$Tobias Grothe– Tobias Grothe2024年09月15日 16:13:03 +00:00Commented Sep 15, 2024 at 16:13
Explore related questions
See similar questions with these tags.
RFC 4180
? In which case this code is obviously simplistic and can't parse a CSV. \$\endgroup\$"bla, \"bl,ah\""
and see why rolling your own CSV parser is almost always a mistake. Also, not all CSV files use a comma as the separator, many non-english countries use a semi-colon. \$\endgroup\$