I (ruby beginner) have a performance issue with the following ruby code. The idea is to aggregate a CSV into a smaller, better-to-handle file and calculating the average of some values. Basically the same code in groovy runs about 30 times faster. And I can't believe that Ruby is that slow! ;-)
Some background: The file consists of lines like this (it is an output file of a JMeter performance test):
timeStamp|elapsed|label|responseCode|responseMessage|threadName|dataType|success|failureMessage|bytes|Latency
2013年05月17日_16:30:11.261|4779|Session_Cookie|200|OK|Thread-Gruppe 1-1|text|true||21647|4739
All values of e.g. a certain minute are selected by looking at the first 16 characters of the timestamp:
2013年05月17日_16:30:11.261 => 2013年05月17日_16:30
I wanted to collect values in buckets/slices which are represented by the shortened timestamp and the label (third column). The value of the second column ("elapsed") is summed up.
require 'csv'
require_relative 'slice'
# Parameters: <precision> <input file> [<output file>]
precision = ARGV[0].to_i
input_file = ARGV[1]
output_file = ARGV[2]
time_slices = Hash.new
CSV.foreach(input_file, {:col_sep => '|', :headers => :first_line}) do |row|
current_time_slice = row['timeStamp'][0, precision]
if time_slices[current_time_slice] == nil
time_slices[current_time_slice] = Hash.new
end
if time_slices[current_time_slice][row['label']]
time_slices[current_time_slice][row['label']].put_line(row)
else
new_slice = Slice.new(current_time_slice, row['label'])
new_slice.put_line(row)
time_slices[current_time_slice][row['label']] = new_slice
end
end
out = File.new(output_file, 'a')
out.puts 'time|label|elapsed_average'
time_slices.values.each do |time_slice|
time_slice.values.each do |slice|
out.puts slice.aggregated_row
end
end
The slice class looks like this:
class Slice
attr_accessor :slice_timestamp, :slice_label, :lines, :sum, :count
def initialize(slice_timestamp, slice_label)
@slice_timestamp = slice_timestamp
@slice_label = slice_label
@count = 0
@sum = 0
end
def put_line(line)
@sum = @sum + line[1].to_i
@count = @count + 1
end
def average
@sum / @count
end
def aggregated_row
@slice_timestamp + '|' + @slice_label + '|' + average.to_s
end
end
I think that I chose a quite straightforward and non-optimized approach, but still - the same approach is much faster in Groovy. What can be the reason for that?
-
\$\begingroup\$ You don't post the groovy code nor the input file (or did I miss the links?) so we cannot compare. Anyway, I don't see anything out of the usual in your code (regarding efficiency, it could be indeed rewritten in a more idiomatic way). Ruby is slow, a very slow language, although 30x vs another dynamic language seems too much. \$\endgroup\$tokland– tokland2013年07月15日 20:11:01 +00:00Commented Jul 15, 2013 at 20:11
-
\$\begingroup\$ I put the Groovy code in this gist: gist.github.com/marvin9000/6006367 (it evolved a little bit in the meantime, but I think the basic idea is still the same). If you want to run it on some testdata, you can download it here: filehosting.org/file/details/435358/input.csv.tgz \$\endgroup\$Elbonian– Elbonian2013年07月16日 07:37:23 +00:00Commented Jul 16, 2013 at 7:37
-
\$\begingroup\$ One more addition: I ran both scripts on the sample data again and the difference was "only" about 3x (6 sec vs. 23 sec). It was more significant on larger files (400+ MB), which I don't want to upload here. \$\endgroup\$Elbonian– Elbonian2013年07月16日 07:43:31 +00:00Commented Jul 16, 2013 at 7:43
-
\$\begingroup\$ For the record - ruby version: ruby 1.9.3p286 (2012年10月12日 revision 37165) [x86_64-darwin11.4.2], Groovy Version: 2.1.3 JVM: 1.6.0_51 Vendor: Apple Inc. OS: Mac OS X \$\endgroup\$Elbonian– Elbonian2013年07月16日 07:47:25 +00:00Commented Jul 16, 2013 at 7:47
1 Answer 1
if time_slices[current_time_slice] == nil
time_slices[current_time_slice] = Hash.new
end
It's more idomatic to write
time_slices[current_time_slice] ||= {}
In general don't use Hash.new
if you can just write {}
. You can rid of that statement alltogether though if you declare the time_slices
variable with
time_slices = Hash.new { |h, k| h[k] = {} }
instead of time_slices = Hash.new
.
if time_slices[current_time_slice][row['label']]
time_slices[current_time_slice][row['label']].put_line(row)
else
new_slice = Slice.new(current_time_slice, row['label'])
new_slice.put_line(row)
time_slices[current_time_slice][row['label']] = new_slice
end
Also
time_slices[current_time_slice][row['label']] ||= Slice.new ...
is more idiomatic here. But there is more to say here:
- Don't forget that Ruby is an interpreted language in principle (although the byte compiler of Ruby 1.9 relativizes that), so don't repeat complex references like
time_slices[current_time_slice][row['label']]
. But that's not only an performance but even more so an readability issue. - Why does a
Slice
instance has to know thecurrent_time_slice
andlabel
? That's completely redundant, since that information is already in the Hash. Get rid of that.
Considering this I would write the loop like this:
time_slices = Hash.new { |h, k| h[k] = Hash.new { |h, k| h[k] = Slice.new } }
CSV.foreach(input_file, col_sep: '|', headers: :first_line) do |row|
current_time_slice = row['timeStamp'][0, precision]
time_slices[current_time_slice][row['label']].put_line(row)
end
Of course you then have to modify your output loop a little
time_slices.each do |slice_timestamp, time_slice|
time_slice.each do |label, slice|
out.puts [slice_timestamp, label, slice.average].join('|')
end
end
So you don't need the Slice#aggregated_row
method anymore. But that's a good idea anyway: e.g. why does Slice
has to know about your pipe separator? That's not a good separation of concerns.
Note that I avoid the +
operator here. That would create a new String instance with every +
call. Alternatively you can also try <<
, but be aware that this modifies the leftmost string (but oftentimes that doesn't matter).
In contrast there is no penalty in using +
in groovy normally, because modern JVMs are using StringBuilder under the hood for it when they see fit. But in Ruby you still should be aware of its implications.
attr_accessor :slice_timestamp, :slice_label, :lines, :sum, :count
Why are you declaring these accessors? They are never used.