3
\$\begingroup\$

I want to determine which separator is used in a csv file. CSV.foreach will return something like this:

["something1;something2;something3"]

The code beneath does the trick, but something better must exist. I find it annoying to have the need for sep_count. Do you know of a method that returns the most frequent of the characters from SEPERATORS?

SEPERATORS = [";", ","]
CSV.foreach(@file, @config) do |header|
 sep_count = 0
 SEPERATORS.each do |seperator|
 if header.first.scan(/#{seperator}/).count > sep_count
 @config[:col_sep] = seperator
 sep_count = header.first.scan(/#{seperator}/).count
 end
 end
 break
end

EDIT:

Based on your awesome answers I got the 1-liner that I asked for:

@config[:col_sep] = %w(; ,).sort_by { |separator| File.open(@file).first(1).join.count(separator) }.last

I have also come up with this piece of code that determines both col_sep and row_sep:

first_line = ""
File.open(@file) do |file|
 file.each_char do |char|
 first_line << char
 if "\r\n".include?(char)
 @config[:row_sep] = first_line.scan(/\n$|\r$/).first
 break
 end
 end
end
@config[:col_sep] = %w(; ,).sort_by { |separator| first_line.count(separator) }.last

By using the full code we ensure that it is always the first line that gets used, and we also set the row_sep. Feel free to comment if you think anything could be improved further.

asked Dec 12, 2012 at 20:46
\$\endgroup\$

3 Answers 3

5
\$\begingroup\$

You can get the most common separator in header with a one-liner like this:

most_common = SEPARATORS.sort_by{|separator| header.count(separator)}.last

But as you have noticed CSV.foreach attempts to split up the rows, assuming by default that the separator is a comma.

You probably need to determine the separator in a preprocessing step before actually doing the CSV processing.

You could just do something like

contents = File.read(@file)
@config[:col_sep] = %w(; ,).sort_by{|separator| contents.count(separator)}.last
CSV.parse(contents, @config) do |row|
 ...
end
# or use the returned array of arrays
rows = CSV.parse(contents, @config)

This might be quite slow if your file is large because you have to read the whole thing into memory. In that case you might want to just look at the first line of the file, and guess the separator from that. To do this, assuming \n is your line separator:

first_line = File.open(@file) do |file|
 file.first
end

Note that you should use the block form to ensure that the file gets closed.

If you need to be line-separator agnostic, I don't think there's a built-in way to do so (although you can change the line separator, that assumes you know it in advance). You might try something like

first_line = ""
File.open(@file) do |file|
 file.each_char do |char|
 break if "\r\n".include?(char)
 first_line << char
 end
end
answered Dec 13, 2012 at 8:30
\$\endgroup\$
4
  • \$\begingroup\$ That is just awesome. It works. Thanks. Only issue is that the search for separators is carried out in the entire file. This means that a file with many ,'s in sentences could have more ,'s than ;'s. But all our 53 tests passes, so all seem fine. \$\endgroup\$ Commented Dec 13, 2012 at 15:51
  • \$\begingroup\$ @Christoffer: File.open('path_to_file.csv').first(10).join results in a string containing the first 10 lines. Should speed things up without affecting your test results. \$\endgroup\$ Commented Dec 13, 2012 at 22:15
  • \$\begingroup\$ Awesome line. I can see it assumes that \n is the line break. Some files uses \r. Is it possible to put in an argument that makes it stop at the first \r or \n? \$\endgroup\$ Commented Dec 14, 2012 at 11:58
  • \$\begingroup\$ @Christoffer: I've updated my answer with a possible approach. \$\endgroup\$ Commented Dec 14, 2012 at 13:39
2
\$\begingroup\$

maybe this?

SEPERATORS = [";", ","]
CSV.foreach(@file, @config) do |header|
 sep_counts = Hash.new(0)
 header.each_char {|c| sep_counts[c] += 1 if SEPERATORS.include? c }
 @config[:col_sep] = sep_counts.sort { |a,b| a[1] <=> b[1] }.first.first
 break
end
answered Dec 12, 2012 at 22:16
\$\endgroup\$
1
  • \$\begingroup\$ Thanks. I have put in a first in line 3, then it works. CSV.foreach(@file, @config) do |header| sep_counts = Hash.new(0) header.first.each_char { |c| sep_counts[c] += 1 if SEPERATORS.include? c } @config[:col_sep] = sep_counts.sort { |a,b| a[1] <=> b[1] }.first.first break end Funny thing though. CSV.foreach on a comma seperated CSV file returns: ["something1", "something2", "something3"] \$\endgroup\$ Commented Dec 13, 2012 at 8:21
0
\$\begingroup\$

I came up with this:

# Guess CSV column separator, based on counts. Tested in Ruby 1.9.3
def guess_columns_separator(string)
 separators = %W(; , \t)
 counts = separators.inject({}){ |hash, separator| hash[string.count(separator)] = separator; hash }
 counts[counts.keys.max]
end
answered Oct 1, 2013 at 10:53
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.