I want to determine which separator is used in a csv file. CSV.foreach will return something like this:
["something1;something2;something3"]
The code beneath does the trick, but something better must exist. I find it annoying to have the need for sep_count. Do you know of a method that returns the most frequent of the characters from SEPERATORS?
SEPERATORS = [";", ","]
CSV.foreach(@file, @config) do |header|
sep_count = 0
SEPERATORS.each do |seperator|
if header.first.scan(/#{seperator}/).count > sep_count
@config[:col_sep] = seperator
sep_count = header.first.scan(/#{seperator}/).count
end
end
break
end
EDIT:
Based on your awesome answers I got the 1-liner that I asked for:
@config[:col_sep] = %w(; ,).sort_by { |separator| File.open(@file).first(1).join.count(separator) }.last
I have also come up with this piece of code that determines both col_sep and row_sep:
first_line = ""
File.open(@file) do |file|
file.each_char do |char|
first_line << char
if "\r\n".include?(char)
@config[:row_sep] = first_line.scan(/\n$|\r$/).first
break
end
end
end
@config[:col_sep] = %w(; ,).sort_by { |separator| first_line.count(separator) }.last
By using the full code we ensure that it is always the first line that gets used, and we also set the row_sep. Feel free to comment if you think anything could be improved further.
3 Answers 3
You can get the most common separator in header
with a one-liner like this:
most_common = SEPARATORS.sort_by{|separator| header.count(separator)}.last
But as you have noticed CSV.foreach
attempts to split up the rows, assuming by default that the separator is a comma.
You probably need to determine the separator in a preprocessing step before actually doing the CSV processing.
You could just do something like
contents = File.read(@file)
@config[:col_sep] = %w(; ,).sort_by{|separator| contents.count(separator)}.last
CSV.parse(contents, @config) do |row|
...
end
# or use the returned array of arrays
rows = CSV.parse(contents, @config)
This might be quite slow if your file is large because you have to read the whole thing into memory. In that case you might want to just look at the first line of the file, and guess the separator from that. To do this, assuming \n
is your line separator:
first_line = File.open(@file) do |file|
file.first
end
Note that you should use the block form to ensure that the file gets closed.
If you need to be line-separator agnostic, I don't think there's a built-in way to do so (although you can change the line separator, that assumes you know it in advance). You might try something like
first_line = ""
File.open(@file) do |file|
file.each_char do |char|
break if "\r\n".include?(char)
first_line << char
end
end
-
\$\begingroup\$ That is just awesome. It works. Thanks. Only issue is that the search for separators is carried out in the entire file. This means that a file with many ,'s in sentences could have more ,'s than ;'s. But all our 53 tests passes, so all seem fine. \$\endgroup\$Christoffer– Christoffer2012年12月13日 15:51:30 +00:00Commented Dec 13, 2012 at 15:51
-
\$\begingroup\$ @Christoffer:
File.open('path_to_file.csv').first(10).join
results in a string containing the first 10 lines. Should speed things up without affecting your test results. \$\endgroup\$steenslag– steenslag2012年12月13日 22:15:58 +00:00Commented Dec 13, 2012 at 22:15 -
\$\begingroup\$ Awesome line. I can see it assumes that \n is the line break. Some files uses \r. Is it possible to put in an argument that makes it stop at the first \r or \n? \$\endgroup\$Christoffer– Christoffer2012年12月14日 11:58:20 +00:00Commented Dec 14, 2012 at 11:58
-
\$\begingroup\$ @Christoffer: I've updated my answer with a possible approach. \$\endgroup\$Andrew Haines– Andrew Haines2012年12月14日 13:39:00 +00:00Commented Dec 14, 2012 at 13:39
maybe this?
SEPERATORS = [";", ","]
CSV.foreach(@file, @config) do |header|
sep_counts = Hash.new(0)
header.each_char {|c| sep_counts[c] += 1 if SEPERATORS.include? c }
@config[:col_sep] = sep_counts.sort { |a,b| a[1] <=> b[1] }.first.first
break
end
-
\$\begingroup\$ Thanks. I have put in a first in line 3, then it works. CSV.foreach(@file, @config) do |header| sep_counts = Hash.new(0) header.first.each_char { |c| sep_counts[c] += 1 if SEPERATORS.include? c } @config[:col_sep] = sep_counts.sort { |a,b| a[1] <=> b[1] }.first.first break end Funny thing though. CSV.foreach on a comma seperated CSV file returns: ["something1", "something2", "something3"] \$\endgroup\$Christoffer– Christoffer2012年12月13日 08:21:51 +00:00Commented Dec 13, 2012 at 8:21
I came up with this:
# Guess CSV column separator, based on counts. Tested in Ruby 1.9.3
def guess_columns_separator(string)
separators = %W(; , \t)
counts = separators.inject({}){ |hash, separator| hash[string.count(separator)] = separator; hash }
counts[counts.keys.max]
end