Look through a string and return the most frequent character (Ruby)

Question 1

I want to determine which separator is used in a csv file. CSV.foreach will return something like this:

["something1;something2;something3"]

The code beneath does the trick, but something better must exist. I find it annoying to have the need for sep_count. Do you know of a method that returns the most frequent of the characters from SEPERATORS?

SEPERATORS = [";", ","]
CSV.foreach(@file, @config) do |header|
 sep_count = 0
 SEPERATORS.each do |seperator|
 if header.first.scan(/#{seperator}/).count > sep_count
 @config[:col_sep] = seperator
 sep_count = header.first.scan(/#{seperator}/).count
 end
 end
 break
end

EDIT:

Based on your awesome answers I got the 1-liner that I asked for:

@config[:col_sep] = %w(; ,).sort_by { |separator| File.open(@file).first(1).join.count(separator) }.last

I have also come up with this piece of code that determines both col_sep and row_sep:

first_line = ""
File.open(@file) do |file|
 file.each_char do |char|
 first_line << char
 if "\r\n".include?(char)
 @config[:row_sep] = first_line.scan(/\n$|\r$/).first
 break
 end
 end
end
@config[:col_sep] = %w(; ,).sort_by { |separator| first_line.count(separator) }.last

By using the full code we ensure that it is always the first line that gets used, and we also set the row_sep. Feel free to comment if you think anything could be improved further.

Question 2

You can get the most common separator in header with a one-liner like this:

most_common = SEPARATORS.sort_by{|separator| header.count(separator)}.last

But as you have noticed CSV.foreach attempts to split up the rows, assuming by default that the separator is a comma.

You probably need to determine the separator in a preprocessing step before actually doing the CSV processing.

You could just do something like

contents = File.read(@file)
@config[:col_sep] = %w(; ,).sort_by{|separator| contents.count(separator)}.last
CSV.parse(contents, @config) do |row|
 ...
end
# or use the returned array of arrays
rows = CSV.parse(contents, @config)

This might be quite slow if your file is large because you have to read the whole thing into memory. In that case you might want to just look at the first line of the file, and guess the separator from that. To do this, assuming \n is your line separator:

first_line = File.open(@file) do |file|
 file.first
end

Note that you should use the block form to ensure that the file gets closed.

If you need to be line-separator agnostic, I don't think there's a built-in way to do so (although you can change the line separator, that assumes you know it in advance). You might try something like

first_line = ""
File.open(@file) do |file|
 file.each_char do |char|
 break if "\r\n".include?(char)
 first_line << char
 end
end

Question 3

That is just awesome. It works. Thanks. Only issue is that the search for separators is carried out in the entire file. This means that a file with many ,'s in sentences could have more ,'s than ;'s. But all our 53 tests passes, so all seem fine.

Question 4

@Christoffer: File.open('path_to_file.csv').first(10).join results in a string containing the first 10 lines. Should speed things up without affecting your test results.

Question 5

Awesome line. I can see it assumes that \n is the line break. Some files uses \r. Is it possible to put in an argument that makes it stop at the first \r or \n?

Question 6

@Christoffer: I've updated my answer with a possible approach.

Question 7

maybe this?

SEPERATORS = [";", ","]
CSV.foreach(@file, @config) do |header|
 sep_counts = Hash.new(0)
 header.each_char {|c| sep_counts[c] += 1 if SEPERATORS.include? c }
 @config[:col_sep] = sep_counts.sort { |a,b| a[1] <=> b[1] }.first.first
 break
end

Question 8

Thanks. I have put in a first in line 3, then it works. CSV.foreach(@file, @config) do |header| sep_counts = Hash.new(0) header.first.each_char { |c| sep_counts[c] += 1 if SEPERATORS.include? c } @config[:col_sep] = sep_counts.sort { |a,b| a[1] <=> b[1] }.first.first break end Funny thing though. CSV.foreach on a comma seperated CSV file returns: ["something1", "something2", "something3"]

Question 9

I came up with this:

# Guess CSV column separator, based on counts. Tested in Ruby 1.9.3
def guess_columns_separator(string)
 separators = %W(; , \t)
 counts = separators.inject({}){ |hash, separator| hash[string.count(separator)] = separator; hash }
 counts[counts.keys.max]
end

Andrew Haines Andrew Haines 1963 bronze badges · Accepted Answer · 2012-12-13 08:30:33Z

You can get the most common separator in header with a one-liner like this:

most_common = SEPARATORS.sort_by{|separator| header.count(separator)}.last

But as you have noticed CSV.foreach attempts to split up the rows, assuming by default that the separator is a comma.

You probably need to determine the separator in a preprocessing step before actually doing the CSV processing.

You could just do something like

contents = File.read(@file)
@config[:col_sep] = %w(; ,).sort_by{|separator| contents.count(separator)}.last
CSV.parse(contents, @config) do |row|
 ...
end
# or use the returned array of arrays
rows = CSV.parse(contents, @config)

This might be quite slow if your file is large because you have to read the whole thing into memory. In that case you might want to just look at the first line of the file, and guess the separator from that. To do this, assuming \n is your line separator:

first_line = File.open(@file) do |file|
 file.first
end

Note that you should use the block form to ensure that the file gets closed.

If you need to be line-separator agnostic, I don't think there's a built-in way to do so (although you can change the line separator, that assumes you know it in advance). You might try something like

first_line = ""
File.open(@file) do |file|
 file.each_char do |char|
 break if "\r\n".include?(char)
 first_line << char
 end
end

That is just awesome. It works. Thanks. Only issue is that the search for separators is carried out in the entire file. This means that a file with many ,'s in sentences could have more ,'s than ;'s. But all our 53 tests passes, so all seem fine.
@Christoffer: File.open('path_to_file.csv').first(10).join results in a string containing the first 10 lines. Should speed things up without affecting your test results.
Awesome line. I can see it assumes that \n is the line break. Some files uses \r. Is it possible to put in an argument that makes it stop at the first \r or \n?
@Christoffer: I've updated my answer with a possible approach.

Stack Exchange Network

Look through a string and return the most frequent character (Ruby)

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Look through a string and return the most frequent character (Ruby)

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions