1
\$\begingroup\$

I (ruby beginner) have a performance issue with the following ruby code. The idea is to aggregate a CSV into a smaller, better-to-handle file and calculating the average of some values. Basically the same code in groovy runs about 30 times faster. And I can't believe that Ruby is that slow! ;-)

Some background: The file consists of lines like this (it is an output file of a JMeter performance test):

timeStamp|elapsed|label|responseCode|responseMessage|threadName|dataType|success|failureMessage|bytes|Latency
2013年05月17日_16:30:11.261|4779|Session_Cookie|200|OK|Thread-Gruppe 1-1|text|true||21647|4739

All values of e.g. a certain minute are selected by looking at the first 16 characters of the timestamp:

2013年05月17日_16:30:11.261 => 2013年05月17日_16:30

I wanted to collect values in buckets/slices which are represented by the shortened timestamp and the label (third column). The value of the second column ("elapsed") is summed up.

require 'csv'
require_relative 'slice'
# Parameters: <precision> <input file> [<output file>]
precision = ARGV[0].to_i
input_file = ARGV[1]
output_file = ARGV[2]
time_slices = Hash.new
CSV.foreach(input_file, {:col_sep => '|', :headers => :first_line}) do |row|
 current_time_slice = row['timeStamp'][0, precision]
 if time_slices[current_time_slice] == nil
 time_slices[current_time_slice] = Hash.new
 end
 if time_slices[current_time_slice][row['label']]
 time_slices[current_time_slice][row['label']].put_line(row)
 else
 new_slice = Slice.new(current_time_slice, row['label'])
 new_slice.put_line(row)
 time_slices[current_time_slice][row['label']] = new_slice
 end
end
out = File.new(output_file, 'a')
out.puts 'time|label|elapsed_average'
time_slices.values.each do |time_slice|
 time_slice.values.each do |slice|
 out.puts slice.aggregated_row
 end
end

The slice class looks like this:

class Slice
 attr_accessor :slice_timestamp, :slice_label, :lines, :sum, :count
 def initialize(slice_timestamp, slice_label)
 @slice_timestamp = slice_timestamp
 @slice_label = slice_label
 @count = 0
 @sum = 0
 end
 def put_line(line)
 @sum = @sum + line[1].to_i
 @count = @count + 1
 end
 def average
 @sum / @count
 end
 def aggregated_row
 @slice_timestamp + '|' + @slice_label + '|' + average.to_s
 end
end

I think that I chose a quite straightforward and non-optimized approach, but still - the same approach is much faster in Groovy. What can be the reason for that?

asked Jul 15, 2013 at 9:13
\$\endgroup\$
4
  • \$\begingroup\$ You don't post the groovy code nor the input file (or did I miss the links?) so we cannot compare. Anyway, I don't see anything out of the usual in your code (regarding efficiency, it could be indeed rewritten in a more idiomatic way). Ruby is slow, a very slow language, although 30x vs another dynamic language seems too much. \$\endgroup\$ Commented Jul 15, 2013 at 20:11
  • \$\begingroup\$ I put the Groovy code in this gist: gist.github.com/marvin9000/6006367 (it evolved a little bit in the meantime, but I think the basic idea is still the same). If you want to run it on some testdata, you can download it here: filehosting.org/file/details/435358/input.csv.tgz \$\endgroup\$ Commented Jul 16, 2013 at 7:37
  • \$\begingroup\$ One more addition: I ran both scripts on the sample data again and the difference was "only" about 3x (6 sec vs. 23 sec). It was more significant on larger files (400+ MB), which I don't want to upload here. \$\endgroup\$ Commented Jul 16, 2013 at 7:43
  • \$\begingroup\$ For the record - ruby version: ruby 1.9.3p286 (2012年10月12日 revision 37165) [x86_64-darwin11.4.2], Groovy Version: 2.1.3 JVM: 1.6.0_51 Vendor: Apple Inc. OS: Mac OS X \$\endgroup\$ Commented Jul 16, 2013 at 7:47

1 Answer 1

2
\$\begingroup\$
if time_slices[current_time_slice] == nil
 time_slices[current_time_slice] = Hash.new
end

It's more idomatic to write

time_slices[current_time_slice] ||= {}

In general don't use Hash.new if you can just write {}. You can rid of that statement alltogether though if you declare the time_slices variable with

time_slices = Hash.new { |h, k| h[k] = {} }

instead of time_slices = Hash.new.

if time_slices[current_time_slice][row['label']]
 time_slices[current_time_slice][row['label']].put_line(row)
else
 new_slice = Slice.new(current_time_slice, row['label'])
 new_slice.put_line(row)
 time_slices[current_time_slice][row['label']] = new_slice
end

Also

time_slices[current_time_slice][row['label']] ||= Slice.new ...

is more idiomatic here. But there is more to say here:

  1. Don't forget that Ruby is an interpreted language in principle (although the byte compiler of Ruby 1.9 relativizes that), so don't repeat complex references like time_slices[current_time_slice][row['label']]. But that's not only an performance but even more so an readability issue.
  2. Why does a Slice instance has to know the current_time_slice and label? That's completely redundant, since that information is already in the Hash. Get rid of that.

Considering this I would write the loop like this:

time_slices = Hash.new { |h, k| h[k] = Hash.new { |h, k| h[k] = Slice.new } }
CSV.foreach(input_file, col_sep: '|', headers: :first_line) do |row|
 current_time_slice = row['timeStamp'][0, precision]
 time_slices[current_time_slice][row['label']].put_line(row)
end

Of course you then have to modify your output loop a little

time_slices.each do |slice_timestamp, time_slice|
 time_slice.each do |label, slice|
 out.puts [slice_timestamp, label, slice.average].join('|')
 end
end

So you don't need the Slice#aggregated_row method anymore. But that's a good idea anyway: e.g. why does Slice has to know about your pipe separator? That's not a good separation of concerns.

Note that I avoid the + operator here. That would create a new String instance with every + call. Alternatively you can also try <<, but be aware that this modifies the leftmost string (but oftentimes that doesn't matter).

In contrast there is no penalty in using + in groovy normally, because modern JVMs are using StringBuilder under the hood for it when they see fit. But in Ruby you still should be aware of its implications.

attr_accessor :slice_timestamp, :slice_label, :lines, :sum, :count

Why are you declaring these accessors? They are never used.

answered Nov 10, 2013 at 8:24
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.