I am attempting to translate some of my Perl code to Ruby, from http://rosettacode.org/wiki/Welch%27s_t-test
I am a very very beginner with Ruby, but I have verified that this is producing the correct answers. Is this "idiomatic" Ruby code? how is this with regard to best practices?
def calculate_Pvalue (array1, array2)
if array1.size <= 1
return 1.0
end
if array2.size <= 1
return 1.0
end
mean1 = array1.sum / array1.size
mean2 = array2.sum / array2.size
if mean1 == mean2
return 1.0
end
variance1 = 0.0
variance2 = 0.0
array1.each do |x|
variance1 += (mean1 - x)**2
end
array2.each do |x|
variance2 += (mean2 - x)**2
end
if variance1 == 0.0 && variance2 == 0.0
return 1.0
end
variance1 = variance1 / (array1.size - 1)
variance2 = variance2 / (array2.size - 1)
welch_t_statistic = (mean1 - mean2) / Math.sqrt(variance1 / array1.size + variance2/array2.size)
degrees_of_freedom = ((variance1/array1.size+variance2/array2.size)**2) / (
(variance1*variance1)/(array1.size*array1.size*(array1.size-1))+
(variance2*variance2)/(array2.size*array2.size*(array2.size-1)));
a = degrees_of_freedom / 2
value = degrees_of_freedom / (welch_t_statistic**2 + degrees_of_freedom)
beta = Math.lgamma(a)[0] + 0.57236494292470009 - Math.lgamma(a + 0.5)[0]
acu = 10**(-15)
if a <= 0
return value
end
if value < 0.0 || 1.0 < value
return value
end
if value == 0 or value == 1.0
return value
end
psq = a + 0.5
cx = 1.0 - value
if a < psq * value
xx = cx
cx = value
pp = 0.5
qq = a
indx = 1
else
xx = value
pp = a
qq = 0.5
indx = 0
end
term = 1.0;
ai = 1.0;
value = 1.0;
ns = (qq + cx * psq).to_i;
#Soper reduction formula
rx = xx / cx
temp = qq - ai
while 1
term = term * temp * rx / ( pp + ai );
value = value + term;
temp = term.abs;
if temp <= acu && temp <= acu * value
#p "pp = #{pp}"
#p "xx = #{xx}"
#p "qq = #{qq}"
#p "cx = #{cx}"
#p "beta = #{beta}"
value = value * Math.exp(pp * Math.log(xx) + (qq - 1.0) * Math.log(cx) - beta) / pp;
#p "value = #{value}"
value = 1.0 - value if indx
if indx == 0
value = 1.0 - value
end
break
end
ai = ai + 1.0
ns = ns - 1
#p "ai = #{ai}"
#p "ns = #{ns}"
if 0 <= ns
temp = qq - ai
rx = xx if ns == 0
else
temp = psq
psq = psq + 1.0
end
end
return value
end
d1 = [27.5,21.0,19.0,23.6,17.0,17.9,16.9,20.1,21.9,22.6,23.1,19.6,19.0,21.7,21.4]
d2 = [27.1,22.0,20.8,23.4,23.4,23.5,25.8,22.0,24.8,20.2,21.9,22.1,22.9,20.5,24.4]
d3 = [17.2,20.9,22.6,18.1,21.7,21.4,23.5,24.2,14.7,21.8]
d4 = [21.5,22.8,21.0,23.0,21.6,23.6,22.5,20.7,23.4,21.8,20.7,21.7,21.5,22.5,23.6,21.5,22.5,23.5,21.5,21.8]
d5 = [19.8,20.4,19.6,17.8,18.5,18.9,18.3,18.9,19.5,22.0]
d6 = [28.2,26.6,20.1,23.3,25.2,22.1,17.7,27.6,20.6,13.7,23.2,17.5,20.6,18.0,23.9,21.6,24.3,20.4,24.0,13.2]
d7 = [30.02,29.99,30.11,29.97,30.01,29.99]
d8 = [29.89,29.93,29.72,29.98,30.02,29.98]
x = [3.0,4.0,1.0,2.1]
y = [490.2,340.0,433.9]
s1 = [1.0/15,10.0/62.0]
s2 = [1.0/10,2/50.0]
v1 = [0.010268,0.000167,0.000167]
v2 = [0.159258,0.136278,0.122389]
z1 = [9/23.0,21/45.0,0/38.0]
z2 = [0/44.0,42/94.0,0/22.0]
CORRECT_ANSWERS = [0.021378001462867, 0.148841696605327, 0.0359722710297968,
0.090773324285671, 0.0107515611497845, 0.00339907162713746, 0.52726574965384, 0.545266866977794]
pvalue = calculate_Pvalue(d1,d2)
error = (pvalue - CORRECT_ANSWERS[0]).abs
printf("Test sets 1 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(d3,d4)
error += (pvalue - CORRECT_ANSWERS[1]).abs
printf("Test sets 2 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(d5,d6)
error += (pvalue - CORRECT_ANSWERS[2]).abs
printf("Test sets 3 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(d7,d8)
error += (pvalue - CORRECT_ANSWERS[3]).abs
printf("Test sets 4 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(x,y)
error += (pvalue - CORRECT_ANSWERS[4]).abs
printf("Test sets 5 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(v1,v2)
error += (pvalue - CORRECT_ANSWERS[5]).abs
printf("Test sets 6 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(s1,s2)
error += (pvalue - CORRECT_ANSWERS[6]).abs
printf("Test sets 7 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(z1,z2)
error += (pvalue - CORRECT_ANSWERS[7]).abs
printf("Test sets z p-value = %.14g\n",pvalue)
printf("the cumulative error is %g\n", error)
2 Answers 2
Warnings
You should always turn warnings on during development. Your code generates 6 warnings, all of which are easy to fix:
test.rb:1: warning: parentheses after method name is interpreted as an argument list, not a decomposed argument
test.rb:51: warning: mismatched indentations at 'else' with 'if' at 45
test.rb:56: warning: mismatched indentations at 'end' with 'if' at 45
test.rb:93: warning: mismatched indentations at 'end' with 'while' at 64
test.rb:64: warning: literal in condition
test.rb:95: warning: mismatched indentations at 'end' with 'def' at 1
Formatting
You should use your text editor's auto-format feature.
Linting
If you want to make sure that your code is idiomatic, you should use an automated linting tool that checks for violations of Community Coding Style Guidelines, common mistakes, complexity, etc. One popular tool for this is Rubocop. If I run Rubocop with the default settings on your code, I get a whopping 408 violations. Thankfully, 391 of those can be auto-corrected, leaving only 17, of which 13 are trivially correctable.
Consistency
Consistency is very important when writing code. Actually, it is very important when writing anything technical. If there are two different ways to write something, and you use two different ways to write it, people are going to assume that you wanted to express two different things, and thus they will waste time trying to figure out why you expressed the same thing two different ways.
It is important not only to be consistent with yourself within your code, but also with the community as a whole. There are Community Coding Style Guidelines in pretty much every community, and usually, there are also tools that automatically format your code according to those guidelines.
Here are a couple of examples:
Indentation
Amount: Sometimes, you indent by 1 column, sometimes by 2, sometimes by 4. There does not seem to be any logic whatsoever as to which amount of indentation you choose. Idiomatic would be 2. For example, the first three lines:
def calculate_Pvalue (array1, array2) if array1.size <= 1 return 1.0
When to indent: Sometimes, you indent the body of a loop or a conditional expression, sometimes you don't. Again, there does not seem to be any logic to it. Idiomatic would be to indent. Also, sometimes you indent just randomly in the middle of a sequence of expressions. Also in this case, I can't see any logic to it. Idiomatic would be to not indent.
Whitespace
- Around Operators: Sometimes, you put whitespace around operators, sometimes not. Idiomatic would be to use whitespace. Personally, I sometimes use different amounts of whitespace to emphasize precedence rules, e.g. I would write
a*b + c
, but I can't find such a logic in your case, for example here:variance1 / array1.size + variance2/array2.size
- Around Commas: Sometimes, you have no whitespace around commas, sometimes you have whitespace after a comma. Idiomatic would be to have 1 space after a comma.
- Within parentheses: Sometimes, you have whitespace after the opening and before the closing parenthesis, sometimes not. Idiomatic would be to have no whitespace.
- Around Operators: Sometimes, you put whitespace around operators, sometimes not. Idiomatic would be to use whitespace. Personally, I sometimes use different amounts of whitespace to emphasize precedence rules, e.g. I would write
Operators
Boolean keyword operators: Sometimes you use the operator symbols (
&&
/||
), sometimes you use the keywords (and
/or
). I cannot detect any logic as to when you use which. Idiomatic would be to use the symbol versions. Example:if value < 0.0 || 1.0 < value return value end if value == 0 or value == 1.0 return value end
Compound assignment: Sometimes, you use compound assignment (
a += b
) and sometimes the long form (a = a + b
). Idiomatic would be to use the compound form, unless it severely hinders readability. Examples:variance1 += (mean1 - x)**2
andpsq = psq + 1.0
.
Naming: sometimes, you spell p-value
pvalue
and sometimesPvalue
. You should be consistent and pick one. Idiomatic would bepvalue
orp_value
. Example:pvalue = calculate_Pvalue(d1,d2)
Naming
Snake Case: Names of methods, parameters, local variables, instance variables, class hierarchy variables, and global variables should be written in snake_case. Constants should be written in SCREAMING_SNAKE_CASE. The exception are constants that refer to modules or classes, and methods that are closely related to classes (E.g.
Kernel#Hash
,Kernel#Float
), which are written in PascalCase. So,calculate_Pvalue
should be eithercalculate_pvalue
orcalculate_p_value
.Intention-revealing names: Names should be intention-revealing. There are couple of names in your code that don't really help with understanding, e.g.
temp
andvalue
.
Layout
Whitespace: Your code is all mashed together with no room to breathe. You should break up logical blocks with whitespace to make the flow easier to follow.
Indentation: You already start your code indented by 1 space. Your code should start with no indentation. Then, blocks should be indented bxy 2 spaces.
No whitespace after method name: both when sending a message (calling a method) and when defining a method, there should be no whitespace between the name and the opening parenthesis of the argument list / parameter list:
def calculate_Pvalue (array1, array2)
Semantics
Useless conditional:
value = 1.0 - value if indx
There is no way that
indx
can ever become falsy in your code (it will always be either0
or1
, both of which are truthy), so this conditional is redundant and should just be:value = 1.0 - value
Redundant conditionals:
if value < 0.0 || 1.0 < value return value end if value == 0 or value == 1.0 return value end
You
return
thevalue
if it is less than zero and you alsoreturn
thevalue
if it is equal to zero. That is the same thing as returning the value if it is less than or equal to zero. Same for the other conditional. Replace withif value <= 0.0 || 1.0 <= value return value end
Use
Range#cover?
: Actually, the above conditional is even better written asunless (0.0..1.0).cover?(value) return value end
Use modifier-
if
for single-expression conditionals and guard clauses: preferreturn value unless (0.0..1.0).cover?(value)
Useless use of
Float
literals: In the above and many other places, you could equally well useInteger
literals. An explicitFloat
literal is only required when you specifically need to call aFloat
method, e.g. for division. So, just use1
and0
instead.Use
Numeric#zero?
predicate instead of testing for equality with0
: There are multiple places where you test for equality with0
. It would be more idiomatic to use theNumeric#zero?
predicate instead:return 1 if variance1.zero? && variance2.zero?
Avoid loops: In general, you should avoid loops as much as possible. You should prefer high-level iterators such as
inject
,map
,select
, etc. Only if that is not possible, useeach
. In case you need an infinite loop, useKernel#loop
. Don't usewhile true
and most definitely do not usewhile 1
. For example,variance1 = 0 variance2 = 0 array1.each do |x| variance1 += (mean1 - x)**2 end array2.each do |x| variance2 += (mean2 - x)**2 end
becomes
variance1 = array1.inject(0) {|acc, x| acc + (mean1 - x)**2 } variance2 = array2.inject(0) {|acc, x| acc + (mean2 - x)**2 }
Redundant
return
: In Ruby, the last expression evaluated inside a block, method, module, or class is the return value. So, instead ofreturn value
, you can just writevalue
.Magic number: What does
0.57236494292470009
mean? You should give it a name.*Use
Array#first
instead of[0]
: Replacebeta = Math.lgamma(a)[0] + 0.57236494292470009 - Math.lgamma(a + 0.5)[0]
with
beta = Math.lgamma(a).first + 0.57236494292470009 - Math.lgamma(a + 0.5).first
freeze
objects that are not meant to be modified, especially ones that are assigned to constants:CORRECT_ANSWERS = [0.021378001462867, 0.148841696605327, 0.0359722710297968, 0.090773324285671, 0.0107515611497845, 0.00339907162713746, 0.52726574965384, 0.545266866977794].freeze
Frozen string literals: In the same vein, you should make a habit of adding the magic comment
# frozen_string_literal: true
to all your files.
Commented out code: You should not comment out code. That's what a version control system is for.
Duplication: There is a fair amount of duplication in your code, e.g.
mean1 = array1.sum / array1.size mean2 = array2.sum / array2.size
or
variance1 = 0.0 variance2 = 0.0 array1.each do |x| variance1 += (mean1 - x)**2 end array2.each do |x| variance2 += (mean2 - x)**2 end
You could remove this duplication e.g. by extracting the logic into a method.
Complexity: The method is very complex, according to pretty much any metric (Cyclomatic Complexity, NPath Complexity, ABC, Perceived Complexity, ...). You should break it up into multiple methods.
Documentation: All your methods should have a documentation comment describing their purpose, usage, and pre- and post-conditions. Also, while in general, code should describe itself, in your case, you should document the algorithm and the names of the variables. Presumably, those names come from a well-known paper, in that case, you should cite that paper. (Note: in general, variable names like
x
,xx
,pp
, etc. should be avoided because they do not reveal the intent of the code, but here, I assume they actually are intention-revealing names because they are well-known to a reader versed in statistics. Still, you should document where those names come from.)
-
\$\begingroup\$ thanks, this code is a translation of C, which in turn is a translation of Fortran, so that's where a lot of the quirkiness comes from. The variable names were given by someone else, like you said, I'm not entirely sure what they represent \$\endgroup\$con– con2019年09月12日 16:49:05 +00:00Commented Sep 12, 2019 at 16:49
Some general guidelines:
- Try to keep methods short (+- 10 lines max). Each method should serve a somewhat distinct purpose.
- Use clear variable names
Prefer inline if-statements:
return 1.0 if array1.size <= 1
if array1.size <= 1 return 1.0 end
Another case:
return 1.0 if mean1 == mean2
if mean1 == mean2 return 1.0 end
A big one:
return value if a <= 0
return value if value < 0.0 || value > 1.0
return value if (value == 0) || (value == 1.0)
if a <= 0 return value end if value < 0.0 || 1.0 < value return value end if value == 0 or value == 1.0 return value end
You get the point.