Reading and parsing CSV files

Question 1

This is my first real attempt at a Scala program. I come from a predominantly Java background, so I'd like to know if the program sticks to Scala conventions well.

Is it well readable or should it be formulated differently? To me, there is a lot of lines in the main function which doesn't quite bode right.

It is functioning and I can provide my JUnit tests to prove it. Not all tests pass as those dealing with whitespace are now different.

I didn't want to use any parsing library or support as that doesn't help me learn the core of the language.

package com.wesleyacheson
import scala.io.Source
import scala.annotation.tailrec
class CSVReader2(source: Source) {
 val QuoteChar = '"'
 val LF = '\n'
 val CR = '\r'
 val Separator = ','
 def readAll(lines:List[List[String]] = Nil):List[List[String]] = {
 if (source.hasNext) {
 return readAll(readLine()::lines)
 }
 return lines.reverse;
 }
 @tailrec final def readUntilQuote(buffered: AnyRef with BufferedIterator[Char], partial:String = ""):String = {
 val char = buffered.next()
 val next = if(buffered.hasNext) buffered.head else ""
 (char, next) match {
 case (QuoteChar, QuoteChar) => readUntilQuote(buffered, partial + buffered.next()) //We've read both charachters so skip
 case (QuoteChar, _) => partial
 case _ => readUntilQuote(buffered, partial + char)
 }
 }
 def readLine():List[String] = {
 val buffered = source.buffered
 @tailrec def readLine(tokens:List[String] , partialToken:String):List[String] = {
 val finished = {() => (partialToken::tokens).reverse}
 if (!buffered.hasNext) return finished()
 val char=buffered.next()
 val subsequentChar = if (buffered.hasNext) buffered.head
 (char, subsequentChar) match {
 case (QuoteChar, x) if x != QuoteChar =>
 readLine(tokens, partialToken + readUntilQuote(buffered))
 case (QuoteChar, QuoteChar) =>
 buffered.next();
 readLine(tokens, partialToken + QuoteChar)
 case (Separator, _) =>
 readLine(partialToken::tokens, "")
 case (CR, LF) =>
 buffered.next(); finished()
 case (CR, _) =>
 finished()
 case (LF, _) =>
 finished()
 case _ =>
 readLine(tokens, partialToken + char)
 }
 }
 readLine(Nil, "");
 }
}

If interested, the unit tests are here:

package com.wesleyacheson
import org.junit._
import Assert._
import scala.io.Source
class CsvReader2Test {
 val simpleSource = Source.fromString("""abc,def,ghi
 jkl, zzz""");
 val quotedFields = Source.fromString("foo bar, \"foo bar\"");
 @Test def VerifyReaderPassedMustNotBeNull() {
 val source: Source = null;
 try {
 new CSVReader(source)
 fail("Should have thrown exception")
 } catch {
 case e: IllegalArgumentException => //Expected
 }
 }
 @Test def VerifyReadAllReturnsAStringList() {
 assertTrue("Expected a List[String]", new CSVReader2(simpleSource).readAll().isInstanceOf[List[String]]);
 }
 @Test def VerifyNumberOfReadLinesAreCorrect() {
 assertEquals(2, new CSVReader2(simpleSource).readAll().size)
 }
 @Test def VerifyFirstLineContainsExpectedValues() {
 val firstLine = new CSVReader2(simpleSource).readAll().head
 println(firstLine)
 assertEquals("Checking the number of tokens in the first line", 3, firstLine.size)
 assertTrue("Checking that the line contains " + "abc", firstLine.contains("abc"))
 assertTrue("Checking that the line contains " + "def", firstLine.contains("def"))
 assertTrue("Checking that the line contains " + "ghi", firstLine.contains("ghi"))
 assertFalse("Checking that the line does not contain" + "jkl", firstLine.contains("jkl"))
 }
 @Test def verifyBothQuotedFieldsAreTheSame() {
 val line = new CSVReader2(quotedFields).readAll().head;
 assertEquals("Checking that the number of tokens in the first line", 2, line.size);
 assertEquals("foo bar", line(0));
 assertEquals("foo bar", line(1));
 }
 @Test def verifySimpleQuotedValueIsUnchanged {
 println(new CSVReader2(Source.fromString("\"foo bar\",")).readLine()(0))
 assertEquals("foo bar", new CSVReader2(Source.fromString("\"foo bar\",")).readLine()(0))
 }
 @Test def verifyDoubleQuotesAreConvertedToQuote {
 assertEquals("\"foo bar\"", new CSVReader2(Source.fromString("\"\"foo bar\"\",")).readLine().head)
 }
 @Test def verify3QuotesAreTreatedAsDoubleQuotesWithinSection {
 assertEquals("\"foo bar\"", new CSVReader2(Source.fromString("\"\"\"foo bar\"\"\",")).readLine().head)
 }
 @Test def verifyQuotedCommasAreReturned {
 assertEquals(",", new CSVReader2(Source.fromString("\",\",")).readLine().head)
 }
 @Test def verifyLeadingWhiteSpaceIsRemoved {
 assertEquals("abc def", new CSVReader2(Source.fromString(" abc def")).readLine().head)
 }
 @Test def verifyTailingWhiteSpaceIsRemoved {
 assertEquals("1", new CSVReader2(Source.fromString("1 ")).readLine().head)
 }
 @Test def verifyQuotedLeadingWhitespaceIsPreserved {
 assertEquals(" 1", new CSVReader2(Source.fromString("\" 1\"")).readLine().head)
 }
 @Test def verifyQuotedTailingWhitespaceIsPreserved {
 assertEquals("1 ", new CSVReader2(Source.fromString("\"1 \"")).readLine().head)
 }
 @Test def verifyQuotedBlankLinesArePreserved{
 assertEquals("first\n\rsecond", new CSVReader2(Source.fromString("\"first\n\rsecond\"")).readLine().head)
 }
 @Test def verifyCellsAreInTheRightOrder{
 val returned = new CSVReader2(Source.fromString("first,second")).readLine()
 assertEquals("first", returned(0));
 assertEquals("second", returned(1));
 }
 @Test def verifyRowsAreInTheRightOrder{
 val returned = new CSVReader2(Source.fromString("first\nsecond")).readAll()
 assertEquals("first", returned(0)(0));
 assertEquals("second", returned(1)(0));
 }
 @Test def verifyNewLineCarriageReturnIsOnlyTreatedAsOneBlankLine{
 val returned = new CSVReader2(Source.fromString("first\n\rsecond")).readAll()
 assertEquals("first", returned(0)(0));
 assertEquals("second", returned(1)(0));
 }
 @Test def verifyReadsUntilFirstQuote {
 assertEquals("abc\"defg", new CSVReader2(Source.fromString("")).readUntilQuote(Source.fromString("abc\"\"defg\"hi\"jklmnop").buffered ))
 }
 @Test def verifyThrowsExceptionIfNoQuoteFound {
 try {
 new CSVReader2(Source.fromString("")).readUntilQuote(Source.fromString("a").buffered)
 fail()
 } catch {
 case e => //expected
 }
 }
 }

What I see when I look at this coming from a Java viewpoint. String concatenation isn't usually done with +. However I don't see how to use a string buffer and keep it semi functional.

The program isn't a functional program. I don't think this could have been avoided using a source. Pure functional programming would have meant that I'd have to make concessions like converting the entire input to a string outside and passing that in which isn't practical.

I don't like the argument-less anonymous function but I don't know what else to do for it.

val finished = {() => (partialToken::tokens).reverse}

I've been told that a lazy val may be more appropriate for this, and I tend to agree.

It may be better for extendability if I returned a list of token objects (or is this my Java head interfering?)

I wonder if any traits could be mixed in to make it more rich. If there was a trait dealing with 2-dimension tabular data for instance.

The buffered iterator was a bit of a cheat. My original was far longer until I added that. Feels like maybe I've skipped a bit of learning for the sake of convenience.

Question 2

@Jamal I can't agree with the tag edits. This may be a case of reinventing the wheel but the purpose of it is learning. The focus of the question isn't about how to prevent that. Similarly for csv tag, and the unit-testing. The question isn't about that. The purpose isn't to read a CSV. The tests were only in the question to prove the criteria of code review. The body changes are fine, but I'd like to revert the title and tags.

Question 3

Then you may make these changes.

Question 4

Here are a few things, going from smallest to the more significant.

Try to avoid using return. It isn't idiomatic, and for good reason. So instead of writing if (cond) return a; return b you should prefer if (cond) a else b
Unless I'm missing anything, in readUntilQuote's signature buffered should be of type BufferedIterator[Char] instead of AnyRef with BufferedIterator[Char]. The AnyRef there doesn't do anything useful.
You can use Source and still have functional code. Specifically, try looking at Source.getLines(). It returns an Iterator[String] that is lazily-evaluated but can still be used functionally almost like other collections (you can map, fold, filter, etc.)

Question 5

Below you can see that I've rewritten a bit of your code. For readabilities sake my first suggestion is to attempt to make your function definitions and their parameter lists consistent with the Scala Style Guide. Next, I would also move the @tailrec annotation one line above the function for which it is intended. Lastly, I would rename your inner function such that people reading your code can quickly see that it is an inner 'Helper' function.

 def readLine(): List[String] = {
 // ...
 @tailrec
 def readLineHelper(tokens: List[String], partialToken: String): List[String] = {
 // function body
 }
 readLineHelper(Nil, "")
 }

You also mentioned that you didn't like your finished function. It appears to me that you can improve its implementation in one of two ways. At first glance it seems that readLineHelpers parameters partialToken and tokens are never altered throughout the course of the function, in other words they are immutable values. If this is the case than finished can be declared as a reference to data:

 val finished = (partialToken :: tokens) reverse

If, however, there is some state change to partialToken or tokens then you will need to declare finished as you originally did, as a function value but with the following syntax:

 val finished = () => (partialToken :: tokens) reverse

Tal Pressman Tal Pressman 1212 bronze badges · Answer 1 · 2013-10-03 08:46:07Z

Here are a few things, going from smallest to the more significant.

Try to avoid using return. It isn't idiomatic, and for good reason. So instead of writing if (cond) return a; return b you should prefer if (cond) a else b
Unless I'm missing anything, in readUntilQuote's signature buffered should be of type BufferedIterator[Char] instead of AnyRef with BufferedIterator[Char]. The AnyRef there doesn't do anything useful.
You can use Source and still have functional code. Specifically, try looking at Source.getLines(). It returns an Iterator[String] that is lazily-evaluated but can still be used functionally almost like other collections (you can map, fold, filter, etc.)

t. fochtman t. fochtman 1,5978 silver badges5 bronze badges · Answer 2 · 2014-08-14 23:49:00Z

Below you can see that I've rewritten a bit of your code. For readabilities sake my first suggestion is to attempt to make your function definitions and their parameter lists consistent with the Scala Style Guide. Next, I would also move the @tailrec annotation one line above the function for which it is intended. Lastly, I would rename your inner function such that people reading your code can quickly see that it is an inner 'Helper' function.

 def readLine(): List[String] = {
 // ...
 @tailrec
 def readLineHelper(tokens: List[String], partialToken: String): List[String] = {
 // function body
 }
 readLineHelper(Nil, "")
 }

You also mentioned that you didn't like your finished function. It appears to me that you can improve its implementation in one of two ways. At first glance it seems that readLineHelpers parameters partialToken and tokens are never altered throughout the course of the function, in other words they are immutable values. If this is the case than finished can be declared as a reference to data:

 val finished = (partialToken :: tokens) reverse

If, however, there is some state change to partialToken or tokens then you will need to declare finished as you originally did, as a function value but with the following syntax:

 val finished = () => (partialToken :: tokens) reverse

Stack Exchange Network

Reading and parsing CSV files

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Reading and parsing CSV files

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions