CSV File Parser in Ruby - Attempt No. 2

Question 1

The following is my second attempt at a basic CSV parser in Ruby. I've addressed suggestions that I've got for my previous version here.

I have omitted all test cases - the code listing would be too long otherwise. You can look at the full code-base at GitHub if you want to.

As opposed to the last version, the current solution supports:

multiple different value delimiters,
values surrounded by quotes,
delimiters as a part of quoted values,
quotes as a part of quoted values.

Please note that my intentions are not to implement a full-featured CSV parser, but to improve my Ruby programming skills. I'll appreciate any suggestions.

csv_parser.rb

require_relative 'line_parser'
# An exception to report that the given delimiter is not supported.
#
class UnsupportedDelimiterException < Exception
 attr_reader :delimiter
 def initialize delimiter
 @delimiter = delimiter
 end
 def to_s
 "Unsupported delimiter [#{@delimiter}]."
 end
end
# CSV file parser.
#
# USAGE:
# CsvFile.new(';').parse('data.csv').each { |row| puts row.firstname }
#
class CsvFile
 SUPPORTED_DELIMITERS = [ ",", "|", ";", " ", " "]
 DEFAULT_DELIMITER = ","
 # Initializes the parser.
 # 
 # * *Args* :
 # - +delimiter+ -> The character to be used as delimiter. Defaults to
 # DEFAULT_DELIMITER. Must be one of SUPPORTED_DELIMITERS, otherwise
 # an UnsupportedDelimiterException is raised.
 #
 def initialize delimiter = DEFAULT_DELIMITER
 if not SUPPORTED_DELIMITERS.include? delimiter
 raise UnsupportedDelimiterException.new delimiter
 end
 @line_parser = LineParser.new(Lexer.new delimiter)
 end
 # Parses the given CSV file and returns the result as an array of rows.
 #
 def parse file
 rows = []
 headers = @line_parser.parse file.gets.chomp
 file.each do |line|
 values = {}
 headers.zip(@line_parser.parse line.chomp).each do |key, value|
 values[key] = value
 end
 rows << CsvRow.new(values)
 end
 rows
 end
end
# CSV row.
#
class CsvRow
 # Creates a new CSV row with the given values.
 # 
 # * *Args* :
 # - +values+ -> a hash containing the column -> value mapping
 #
 def initialize values
 @values = values
 end
 # Returns the value in the column given as method name, or null if
 # no such value exists.
 #
 def method_missing name, *args 
 @values[name.to_s]
 end
end

line_parser.rb

require_relative 'lexer'
class ParseError < RuntimeError
end
# CSV line parser.
#
class LineParser
 # Initializes the parser with the given lexer instance.
 #
 def initialize lexer
 @lexer = lexer
 end
 # Parses the given CSV line into a collection of values.
 # 
 def parse line
 values = []
 last_seen_identifier = false
 tokens = @lexer.tokenize line
 tokens.each do |token|
 case token
 when EOFToken
 if not last_seen_identifier
 values << ""
 end
 break
 when DelimiterToken
 if not last_seen_identifier
 values << ""
 next
 else
 last_seen_identifier = false
 end
 when IdentifierToken
 if last_seen_identifier
 raise ParseError, "Unexpected identifier - a delimiter was expected."
 end
 last_seen_identifier = true
 values << token.lexem
 end
 end
 values
 end
end

lexer.rb

require_relative 'assertions'
class LexicalError < RuntimeError
end
class Token
end
class EOFToken < Token
 def to_s
 "EOF"
 end
end
class DelimiterToken < Token
 attr_reader :lexem
 def initialize lexem
 @lexem = lexem
 end
 def to_s
 "DELIMITER(#{@lexem})"
 end
end
class IdentifierToken < Token
 attr_reader :lexem
 def initialize lexem
 @lexem = lexem
 end
 def to_s
 "IDENTIFIER(#{@lexem})"
 end
end
# CSV line lexical analyzer.
#
class Lexer
 # Initialzes the lexer.
 # 
 # * *Args* :
 # - +delimiter+ -> The character to be used as delimiter.
 #
 def initialize delimiter
 @delimiter = delimiter
 end
 # Breaks the given CSV line into a sequence of tokens.
 #
 def tokenize stream
 stream = stream.chars.to_a
 tokens = []
 while true
 tokens << next_token(stream)
 if tokens.last.is_a? EOFToken
 break
 end
 end
 tokens
 end
 private
 def next_token stream
 char = stream.shift
 case char 
 when eof
 EOFToken.new
 when delimiter
 DelimiterToken.new delimiter
 when quotes
 stream.unshift char
 get_quoted_identifier stream
 else
 stream.unshift char
 get_unquoted_identifier stream
 end
 end
 def get_unquoted_identifier stream
 lexem = ""
 while true do
 char = stream.shift
 case char
 when delimiter
 stream.unshift char
 return IdentifierToken.new lexem
 when eof
 return IdentifierToken.new lexem
 else
 lexem << char
 end
 end
 end
 def get_quoted_identifier stream
 char = stream.shift
 assert { char == quotes }
 lexem = ""
 while true do
 char = stream.shift
 case char
 when eof
 raise LexicalError, "Unexpected EOF within a quoted string."
 when quotes
 if stream.first == quotes
 lexem << quotes
 stream.shift
 else
 return IdentifierToken.new lexem
 end
 else
 lexem << char
 end
 end
 end
 def eof
 nil
 end
 def delimiter
 @delimiter
 end
 def quotes
 '"'
 end
end

assertions.rb

# An exception to represent assertion violations.
#
class AssertionError < RuntimeError
end
# Evaluates the given block as a boolean expression and throws an AssertionError
# in case it results to false.
#
def assert &block
 raise AssertionError unless yield
end

Question 2

if not

It is very 'unruby' to use if not. Ruby has a special unless just for these cases:

 unless last_seen_identifier
 values << ""
 end
 break

You can even shorten this further by using the one-liner shorthand:

values << "" unless last_seen_identifier
break

One more general guideline - although you can have an unless .. else .. end construct, it is not recommended. Better to reverse the conditional:

# not good
if not last_seen_identifier
 values << ""
 next
else
 last_seen_identifier = false
end
# not better
unless last_seen_identifier
 values << ""
 next
else
 last_seen_identifier = false
end
# better
if last_seen_identifier
 last_seen_identifier = false
else
 values << ""
 next
end
# best
values << "" unless last_seen_identifier
last_seen_identifier = false

Too Many Classes

Contrary to languages like Java or C#, Ruby is not statically typed, but rather duck typed, which means that you can pass any type anywhere, and the objects don't have to inherit from the same base class.

For example, the class Token adds no functionality, and it is not used anywhere in the code besides being inherited by a bunch of classes. It is totally redundant, and deleting it will not affect the code.

Furthermore, EOFToken and DelimiterToken don't contain any intrinsic functionality, and according to ruby's idioms, a better solution would be to simply pass symbols (:eof and :delimiter) rather than classes.

The last token (IdentifierToken) holds the current value, but you can even skip that, simply passing the value itself instead.

This will make the token iteration look like this:

tokens.each do |token|
 case token
 when :eof
 values << "" unless last_seen_identifier
 break
 when :delimiter
 values << "" unless last_seen_identifier
 last_seen_identifier = false
 else
 if last_seen_identifier
 raise ParseError, "Unexpected identifier - a delimiter was expected."
 end
 last_seen_identifier = true
 values << token
 end
end

Assertions?

Assertions are used by some languages to make some development checks where a condition is assumed to be always true. This methodology is not longer en-vogue in most places, and even where it is - it is also disabled in production:

Most languages allow assertions to be enabled or disabled globally, and sometimes independently. Assertions are often enabled during development and disabled during final testing and on release to the customer. Not checking assertions avoids the cost of evaluating the assertions while (assuming the assertions are free of side effects) still producing the same result under normal conditions. Under abnormal conditions, disabling assertion checking can mean that a program that would have aborted will continue to run. This is sometimes preferable.

In your code the need for the assertion seems a bit artificial, since it checks that the first character is exactly the character pushed back by the caller:

...
 when quotes
 stream.unshift char
 get_quoted_identifier stream
...
char = stream.shift
assert { char == quotes }

that being the case, I would argue that you can remove the unshift/shift tango, and simply assume that the first char is quotes:

...
 when quotes
 get_quoted_identifier stream
...
char = quotes

Question 3

Your UnsupportedDelimiterException class is very clean. I particularly like that you over rode the to_s method so that it returns something meaningful. It's a little detail that often gets over looked. Well done.

That said, I'm not sure what you're really getting from the ParseError class.

class ParseError < RuntimeError
end

Yes, it's nice to get a parse error rather than a runtime error, but that's all you're gaining. It seems a little silly to inherit from a class and then have the exact same functionality.

Question 4

I disagree. It is a very good idea to have separate error classes. It allows you to rescue parse errors only. If you didn't do this, you'd have to rescue all StandardErrors, check the error message and re-raise the exception if you cannot handle it. It's not about functionality in this case, but about providing context and semantics.

Question 5

That's a very valid point @britishtea. Feel free to downvote my answer if you think it's bad advice.

Question 6

This is indeed bad advice. If someone would be using this code and receive only RuntimeError it would be harder to find what the problem is. I understand your point and would agree if would have been something like class SpecificParseError < ParseError that would maybe not have helped, but giving context to the caller is important.

Uri Agassi Uri Agassi 6,7051 gold badge18 silver badges48 bronze badges · Accepted Answer · 2014-11-29 10:49:43Z

if not

It is very 'unruby' to use if not. Ruby has a special unless just for these cases:

 unless last_seen_identifier
 values << ""
 end
 break

You can even shorten this further by using the one-liner shorthand:

values << "" unless last_seen_identifier
break

One more general guideline - although you can have an unless .. else .. end construct, it is not recommended. Better to reverse the conditional:

# not good
if not last_seen_identifier
 values << ""
 next
else
 last_seen_identifier = false
end
# not better
unless last_seen_identifier
 values << ""
 next
else
 last_seen_identifier = false
end
# better
if last_seen_identifier
 last_seen_identifier = false
else
 values << ""
 next
end
# best
values << "" unless last_seen_identifier
last_seen_identifier = false

Too Many Classes

Contrary to languages like Java or C#, Ruby is not statically typed, but rather duck typed, which means that you can pass any type anywhere, and the objects don't have to inherit from the same base class.

For example, the class Token adds no functionality, and it is not used anywhere in the code besides being inherited by a bunch of classes. It is totally redundant, and deleting it will not affect the code.

Furthermore, EOFToken and DelimiterToken don't contain any intrinsic functionality, and according to ruby's idioms, a better solution would be to simply pass symbols (:eof and :delimiter) rather than classes.

The last token (IdentifierToken) holds the current value, but you can even skip that, simply passing the value itself instead.

This will make the token iteration look like this:

tokens.each do |token|
 case token
 when :eof
 values << "" unless last_seen_identifier
 break
 when :delimiter
 values << "" unless last_seen_identifier
 last_seen_identifier = false
 else
 if last_seen_identifier
 raise ParseError, "Unexpected identifier - a delimiter was expected."
 end
 last_seen_identifier = true
 values << token
 end
end

Assertions?

Assertions are used by some languages to make some development checks where a condition is assumed to be always true. This methodology is not longer en-vogue in most places, and even where it is - it is also disabled in production:

Most languages allow assertions to be enabled or disabled globally, and sometimes independently. Assertions are often enabled during development and disabled during final testing and on release to the customer. Not checking assertions avoids the cost of evaluating the assertions while (assuming the assertions are free of side effects) still producing the same result under normal conditions. Under abnormal conditions, disabling assertion checking can mean that a program that would have aborted will continue to run. This is sometimes preferable.

In your code the need for the assertion seems a bit artificial, since it checks that the first character is exactly the character pushed back by the caller:

...
 when quotes
 stream.unshift char
 get_quoted_identifier stream
...
char = stream.shift
assert { char == quotes }

that being the case, I would argue that you can remove the unshift/shift tango, and simply assume that the first char is quotes:

...
 when quotes
 get_quoted_identifier stream
...
char = quotes

Stack Exchange Network

CSV File Parser in Ruby - Attempt No. 2

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CSV File Parser in Ruby - Attempt No. 2

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions