The following is my second attempt at a basic CSV parser in Ruby. I've addressed suggestions that I've got for my previous version here.
I have omitted all test cases - the code listing would be too long otherwise. You can look at the full code-base at GitHub if you want to.
As opposed to the last version, the current solution supports:
- multiple different value delimiters,
- values surrounded by quotes,
- delimiters as a part of quoted values,
- quotes as a part of quoted values.
Please note that my intentions are not to implement a full-featured CSV parser, but to improve my Ruby programming skills. I'll appreciate any suggestions.
csv_parser.rb
require_relative 'line_parser'
# An exception to report that the given delimiter is not supported.
#
class UnsupportedDelimiterException < Exception
attr_reader :delimiter
def initialize delimiter
@delimiter = delimiter
end
def to_s
"Unsupported delimiter [#{@delimiter}]."
end
end
# CSV file parser.
#
# USAGE:
# CsvFile.new(';').parse('data.csv').each { |row| puts row.firstname }
#
class CsvFile
SUPPORTED_DELIMITERS = [ ",", "|", ";", " ", " "]
DEFAULT_DELIMITER = ","
# Initializes the parser.
#
# * *Args* :
# - +delimiter+ -> The character to be used as delimiter. Defaults to
# DEFAULT_DELIMITER. Must be one of SUPPORTED_DELIMITERS, otherwise
# an UnsupportedDelimiterException is raised.
#
def initialize delimiter = DEFAULT_DELIMITER
if not SUPPORTED_DELIMITERS.include? delimiter
raise UnsupportedDelimiterException.new delimiter
end
@line_parser = LineParser.new(Lexer.new delimiter)
end
# Parses the given CSV file and returns the result as an array of rows.
#
def parse file
rows = []
headers = @line_parser.parse file.gets.chomp
file.each do |line|
values = {}
headers.zip(@line_parser.parse line.chomp).each do |key, value|
values[key] = value
end
rows << CsvRow.new(values)
end
rows
end
end
# CSV row.
#
class CsvRow
# Creates a new CSV row with the given values.
#
# * *Args* :
# - +values+ -> a hash containing the column -> value mapping
#
def initialize values
@values = values
end
# Returns the value in the column given as method name, or null if
# no such value exists.
#
def method_missing name, *args
@values[name.to_s]
end
end
line_parser.rb
require_relative 'lexer'
class ParseError < RuntimeError
end
# CSV line parser.
#
class LineParser
# Initializes the parser with the given lexer instance.
#
def initialize lexer
@lexer = lexer
end
# Parses the given CSV line into a collection of values.
#
def parse line
values = []
last_seen_identifier = false
tokens = @lexer.tokenize line
tokens.each do |token|
case token
when EOFToken
if not last_seen_identifier
values << ""
end
break
when DelimiterToken
if not last_seen_identifier
values << ""
next
else
last_seen_identifier = false
end
when IdentifierToken
if last_seen_identifier
raise ParseError, "Unexpected identifier - a delimiter was expected."
end
last_seen_identifier = true
values << token.lexem
end
end
values
end
end
lexer.rb
require_relative 'assertions'
class LexicalError < RuntimeError
end
class Token
end
class EOFToken < Token
def to_s
"EOF"
end
end
class DelimiterToken < Token
attr_reader :lexem
def initialize lexem
@lexem = lexem
end
def to_s
"DELIMITER(#{@lexem})"
end
end
class IdentifierToken < Token
attr_reader :lexem
def initialize lexem
@lexem = lexem
end
def to_s
"IDENTIFIER(#{@lexem})"
end
end
# CSV line lexical analyzer.
#
class Lexer
# Initialzes the lexer.
#
# * *Args* :
# - +delimiter+ -> The character to be used as delimiter.
#
def initialize delimiter
@delimiter = delimiter
end
# Breaks the given CSV line into a sequence of tokens.
#
def tokenize stream
stream = stream.chars.to_a
tokens = []
while true
tokens << next_token(stream)
if tokens.last.is_a? EOFToken
break
end
end
tokens
end
private
def next_token stream
char = stream.shift
case char
when eof
EOFToken.new
when delimiter
DelimiterToken.new delimiter
when quotes
stream.unshift char
get_quoted_identifier stream
else
stream.unshift char
get_unquoted_identifier stream
end
end
def get_unquoted_identifier stream
lexem = ""
while true do
char = stream.shift
case char
when delimiter
stream.unshift char
return IdentifierToken.new lexem
when eof
return IdentifierToken.new lexem
else
lexem << char
end
end
end
def get_quoted_identifier stream
char = stream.shift
assert { char == quotes }
lexem = ""
while true do
char = stream.shift
case char
when eof
raise LexicalError, "Unexpected EOF within a quoted string."
when quotes
if stream.first == quotes
lexem << quotes
stream.shift
else
return IdentifierToken.new lexem
end
else
lexem << char
end
end
end
def eof
nil
end
def delimiter
@delimiter
end
def quotes
'"'
end
end
assertions.rb
# An exception to represent assertion violations.
#
class AssertionError < RuntimeError
end
# Evaluates the given block as a boolean expression and throws an AssertionError
# in case it results to false.
#
def assert &block
raise AssertionError unless yield
end
2 Answers 2
if not
It is very 'unruby' to use if not
. Ruby has a special unless
just for these cases:
unless last_seen_identifier
values << ""
end
break
You can even shorten this further by using the one-liner shorthand:
values << "" unless last_seen_identifier
break
One more general guideline - although you can have an unless .. else .. end
construct, it is not recommended. Better to reverse the conditional:
# not good
if not last_seen_identifier
values << ""
next
else
last_seen_identifier = false
end
# not better
unless last_seen_identifier
values << ""
next
else
last_seen_identifier = false
end
# better
if last_seen_identifier
last_seen_identifier = false
else
values << ""
next
end
# best
values << "" unless last_seen_identifier
last_seen_identifier = false
Too Many Classes
Contrary to languages like Java or C#, Ruby is not statically typed, but rather duck typed, which means that you can pass any type anywhere, and the objects don't have to inherit from the same base class.
For example, the class Token
adds no functionality, and it is not used anywhere in the code besides being inherited by a bunch of classes. It is totally redundant, and deleting it will not affect the code.
Furthermore, EOFToken
and DelimiterToken
don't contain any intrinsic functionality, and according to ruby's idioms, a better solution would be to simply pass symbols (:eof
and :delimiter
) rather than classes.
The last token (IdentifierToken
) holds the current value, but you can even skip that, simply passing the value itself instead.
This will make the token iteration look like this:
tokens.each do |token|
case token
when :eof
values << "" unless last_seen_identifier
break
when :delimiter
values << "" unless last_seen_identifier
last_seen_identifier = false
else
if last_seen_identifier
raise ParseError, "Unexpected identifier - a delimiter was expected."
end
last_seen_identifier = true
values << token
end
end
Assertions?
Assertions are used by some languages to make some development checks where a condition is assumed to be always true. This methodology is not longer en-vogue in most places, and even where it is - it is also disabled in production:
Most languages allow assertions to be enabled or disabled globally, and sometimes independently. Assertions are often enabled during development and disabled during final testing and on release to the customer. Not checking assertions avoids the cost of evaluating the assertions while (assuming the assertions are free of side effects) still producing the same result under normal conditions. Under abnormal conditions, disabling assertion checking can mean that a program that would have aborted will continue to run. This is sometimes preferable.
In your code the need for the assertion seems a bit artificial, since it checks that the first character is exactly the character pushed back by the caller:
...
when quotes
stream.unshift char
get_quoted_identifier stream
...
char = stream.shift
assert { char == quotes }
that being the case, I would argue that you can remove the unshift/shift tango, and simply assume that the first char is quotes
:
...
when quotes
get_quoted_identifier stream
...
char = quotes
Your UnsupportedDelimiterException
class is very clean. I particularly like that you over rode the to_s
method so that it returns something meaningful. It's a little detail that often gets over looked. Well done.
That said, I'm not sure what you're really getting from the ParseError
class.
class ParseError < RuntimeError end
Yes, it's nice to get a parse error rather than a runtime error, but that's all you're gaining. It seems a little silly to inherit from a class and then have the exact same functionality.
-
1\$\begingroup\$ I disagree. It is a very good idea to have separate error classes. It allows you to rescue parse errors only. If you didn't do this, you'd have to rescue all StandardErrors, check the error message and re-raise the exception if you cannot handle it. It's not about functionality in this case, but about providing context and semantics. \$\endgroup\$britishtea– britishtea2014年12月16日 15:03:18 +00:00Commented Dec 16, 2014 at 15:03
-
\$\begingroup\$ That's a very valid point @britishtea. Feel free to downvote my answer if you think it's bad advice. \$\endgroup\$RubberDuck– RubberDuck2014年12月16日 15:05:39 +00:00Commented Dec 16, 2014 at 15:05
-
\$\begingroup\$ This is indeed bad advice. If someone would be using this code and receive only
RuntimeError
it would be harder to find what the problem is. I understand your point and would agree if would have been something likeclass SpecificParseError < ParseError
that would maybe not have helped, but giving context to the caller is important. \$\endgroup\$Marc-Andre– Marc-Andre2015年08月12日 12:40:45 +00:00Commented Aug 12, 2015 at 12:40