I wrote a zscore algorithm in Ruby that runs fine, but is incredibly slow when I have 8000+ entries to score. Can anyone help me figure out a way to improve my code, please?
module Enumerable
def mean
reduce(:+).to_f / length
end
def sample_variance
sum = inject(0){ |acc, i| acc + (i - mean)**2 }
1 / length.to_f * sum
end
def standard_deviation
Math.sqrt(sample_variance)
end
def zscore
if standard_deviation.zero?
Array.new(length, 0)
else
collect { |v| (v - mean) / standard_deviation }
end
end
end
The float is giving every score an accuracy of up to 17 decimal places. Would making it only 8 decimal places speed things up?
EDIT: Here is an updated version of the algorithm given the advice received in this thread.
class Array
def mean(len=self.length)
reduce(:+).to_f / len
end
def sample_variance
len = length
m = mean(len)
sum = reduce { |acc, i| acc + (i - m)**2 }
sum.to_f / len
end
def standard_deviation
Math.sqrt(sample_variance)
end
def zscore
stdev = standard_deviation
m = mean
stdev.zero? ? Array.new(length, 0) : collect { |v| (v - m) / stdev }
end
end
1 Answer 1
The problem is here: collect { |v| (v - mean) / standard_deviation }
. standard_deviation
is constant but, being inside a block, it is called on each iteration. Set the value to a local variable before. As noted by Flambino, the same principle applies to sample_variance
(which uses mean
inside a block).
In a functional language (where immutability is honored) the compiler would be able to do the right thing, but not in an imperative language plagued with side-effects like Ruby.
Some additional notes to your code:
module Enumerable
: But you call.length
, which is not a method that an enumerable is required to implement. Consider adding them toArray
(which includesEnumerable
).reduce
and theninject
. I'd use just one of the alias.
-
\$\begingroup\$ I actually just noticed this before you wrote it. Setting standard_deviation to a variable at the beginning of the zscore method makes this run blazingly fast. Thanks \$\endgroup\$DaniG2k– DaniG2k2014年05月18日 21:39:21 +00:00Commented May 18, 2014 at 21:39
-
1\$\begingroup\$ @DaniG2k You can do the same local var trick with
mean
in yourzscore
andsample_variance
methods. Not as big a boost as storingstandard_deviation
, but the principle's the same \$\endgroup\$Flambino– Flambino2014年05月18日 21:42:47 +00:00Commented May 18, 2014 at 21:42 -
\$\begingroup\$ @tokland I'm confused: the
reduce
method should be called on anEnumerable
butlength
should be called on anArray
. Which is better to use? \$\endgroup\$DaniG2k– DaniG2k2014年05月18日 21:53:01 +00:00Commented May 18, 2014 at 21:53 -
1\$\begingroup\$ @DaniG2k:
Array
includesEnumerable
. \$\endgroup\$tokland– tokland2014年05月18日 21:58:04 +00:00Commented May 18, 2014 at 21:58
Explore related questions
See similar questions with these tags.
sum / (length + 1.0)
. \$\endgroup\$sum/(length + 1.0)
for sample variance? Thanks! Also, this is a population variance as I am running it on all records. \$\endgroup\$sum.to_f / len
reads better than1 / len.to_f * sum
. \$\endgroup\$