Calculating p-value for Welch's t-test

Question 1

I am attempting to translate some of my Perl code to Ruby, from http://rosettacode.org/wiki/Welch%27s_t-test

I am a very very beginner with Ruby, but I have verified that this is producing the correct answers. Is this "idiomatic" Ruby code? how is this with regard to best practices?

 def calculate_Pvalue (array1, array2)
 if array1.size <= 1
 return 1.0
 end
 if array2.size <= 1
 return 1.0
 end
 mean1 = array1.sum / array1.size
 mean2 = array2.sum / array2.size
 if mean1 == mean2
 return 1.0
 end
 variance1 = 0.0
 variance2 = 0.0
 array1.each do |x|
 variance1 += (mean1 - x)**2
 end
 array2.each do |x|
 variance2 += (mean2 - x)**2
 end
 if variance1 == 0.0 && variance2 == 0.0
 return 1.0
 end
 variance1 = variance1 / (array1.size - 1)
 variance2 = variance2 / (array2.size - 1)
 welch_t_statistic = (mean1 - mean2) / Math.sqrt(variance1 / array1.size + variance2/array2.size)
 degrees_of_freedom = ((variance1/array1.size+variance2/array2.size)**2) / (
 (variance1*variance1)/(array1.size*array1.size*(array1.size-1))+
 (variance2*variance2)/(array2.size*array2.size*(array2.size-1)));
 a = degrees_of_freedom / 2
 value = degrees_of_freedom / (welch_t_statistic**2 + degrees_of_freedom)
 beta = Math.lgamma(a)[0] + 0.57236494292470009 - Math.lgamma(a + 0.5)[0]
 acu = 10**(-15)
 if a <= 0
 return value
 end
 if value < 0.0 || 1.0 < value
 return value
 end
 if value == 0 or value == 1.0
 return value
 end
 psq = a + 0.5
 cx = 1.0 - value
 if a < psq * value
 xx = cx
 cx = value
 pp = 0.5
 qq = a
 indx = 1
 else
 xx = value
 pp = a
 qq = 0.5
 indx = 0
 end
 term = 1.0;
 ai = 1.0;
 value = 1.0;
 ns = (qq + cx * psq).to_i;
 #Soper reduction formula
 rx = xx / cx
 temp = qq - ai
 while 1
 term = term * temp * rx / ( pp + ai );
 value = value + term;
 temp = term.abs;
 if temp <= acu && temp <= acu * value
 #p "pp = #{pp}"
 #p "xx = #{xx}"
 #p "qq = #{qq}"
 #p "cx = #{cx}"
 #p "beta = #{beta}"
 value = value * Math.exp(pp * Math.log(xx) + (qq - 1.0) * Math.log(cx) - beta) / pp;
 #p "value = #{value}"
 value = 1.0 - value if indx
 if indx == 0
 value = 1.0 - value
 end
 break
 end
 ai = ai + 1.0
 ns = ns - 1
 #p "ai = #{ai}"
 #p "ns = #{ns}"
 if 0 <= ns
 temp = qq - ai
 rx = xx if ns == 0
 else
 temp = psq
 psq = psq + 1.0
 end
 end
 return value
end
d1 = [27.5,21.0,19.0,23.6,17.0,17.9,16.9,20.1,21.9,22.6,23.1,19.6,19.0,21.7,21.4]
d2 = [27.1,22.0,20.8,23.4,23.4,23.5,25.8,22.0,24.8,20.2,21.9,22.1,22.9,20.5,24.4]
d3 = [17.2,20.9,22.6,18.1,21.7,21.4,23.5,24.2,14.7,21.8]
d4 = [21.5,22.8,21.0,23.0,21.6,23.6,22.5,20.7,23.4,21.8,20.7,21.7,21.5,22.5,23.6,21.5,22.5,23.5,21.5,21.8]
d5 = [19.8,20.4,19.6,17.8,18.5,18.9,18.3,18.9,19.5,22.0]
d6 = [28.2,26.6,20.1,23.3,25.2,22.1,17.7,27.6,20.6,13.7,23.2,17.5,20.6,18.0,23.9,21.6,24.3,20.4,24.0,13.2]
d7 = [30.02,29.99,30.11,29.97,30.01,29.99]
d8 = [29.89,29.93,29.72,29.98,30.02,29.98]
x = [3.0,4.0,1.0,2.1]
y = [490.2,340.0,433.9]
s1 = [1.0/15,10.0/62.0]
s2 = [1.0/10,2/50.0]
v1 = [0.010268,0.000167,0.000167]
v2 = [0.159258,0.136278,0.122389]
z1 = [9/23.0,21/45.0,0/38.0]
z2 = [0/44.0,42/94.0,0/22.0]
CORRECT_ANSWERS = [0.021378001462867, 0.148841696605327, 0.0359722710297968,
0.090773324285671, 0.0107515611497845, 0.00339907162713746, 0.52726574965384, 0.545266866977794]
pvalue = calculate_Pvalue(d1,d2)
error = (pvalue - CORRECT_ANSWERS[0]).abs
printf("Test sets 1 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(d3,d4)
error += (pvalue - CORRECT_ANSWERS[1]).abs
printf("Test sets 2 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(d5,d6)
error += (pvalue - CORRECT_ANSWERS[2]).abs
printf("Test sets 3 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(d7,d8)
error += (pvalue - CORRECT_ANSWERS[3]).abs
printf("Test sets 4 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(x,y)
error += (pvalue - CORRECT_ANSWERS[4]).abs
printf("Test sets 5 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(v1,v2)
error += (pvalue - CORRECT_ANSWERS[5]).abs
printf("Test sets 6 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(s1,s2)
error += (pvalue - CORRECT_ANSWERS[6]).abs
printf("Test sets 7 p-value = %.14g\n",pvalue)
pvalue = calculate_Pvalue(z1,z2)
error += (pvalue - CORRECT_ANSWERS[7]).abs
printf("Test sets z p-value = %.14g\n",pvalue)
printf("the cumulative error is %g\n", error)

Question 2

Warnings

You should always turn warnings on during development. Your code generates 6 warnings, all of which are easy to fix:

test.rb:1: warning: parentheses after method name is interpreted as an argument list, not a decomposed argument
test.rb:51: warning: mismatched indentations at 'else' with 'if' at 45
test.rb:56: warning: mismatched indentations at 'end' with 'if' at 45
test.rb:93: warning: mismatched indentations at 'end' with 'while' at 64
test.rb:64: warning: literal in condition
test.rb:95: warning: mismatched indentations at 'end' with 'def' at 1

Formatting

You should use your text editor's auto-format feature.

Linting

If you want to make sure that your code is idiomatic, you should use an automated linting tool that checks for violations of Community Coding Style Guidelines, common mistakes, complexity, etc. One popular tool for this is Rubocop. If I run Rubocop with the default settings on your code, I get a whopping 408 violations. Thankfully, 391 of those can be auto-corrected, leaving only 17, of which 13 are trivially correctable.

Consistency

Consistency is very important when writing code. Actually, it is very important when writing anything technical. If there are two different ways to write something, and you use two different ways to write it, people are going to assume that you wanted to express two different things, and thus they will waste time trying to figure out why you expressed the same thing two different ways.

It is important not only to be consistent with yourself within your code, but also with the community as a whole. There are Community Coding Style Guidelines in pretty much every community, and usually, there are also tools that automatically format your code according to those guidelines.

Here are a couple of examples:

Indentation
- Amount: Sometimes, you indent by 1 column, sometimes by 2, sometimes by 4. There does not seem to be any logic whatsoever as to which amount of indentation you choose. Idiomatic would be 2. For example, the first three lines:
```
def calculate_Pvalue (array1, array2)
 if array1.size <= 1
 return 1.0
```
- When to indent: Sometimes, you indent the body of a loop or a conditional expression, sometimes you don't. Again, there does not seem to be any logic to it. Idiomatic would be to indent. Also, sometimes you indent just randomly in the middle of a sequence of expressions. Also in this case, I can't see any logic to it. Idiomatic would be to not indent.
Whitespace
- Around Operators: Sometimes, you put whitespace around operators, sometimes not. Idiomatic would be to use whitespace. Personally, I sometimes use different amounts of whitespace to emphasize precedence rules, e.g. I would write a*b + c, but I can't find such a logic in your case, for example here: variance1 / array1.size + variance2/array2.size
- Around Commas: Sometimes, you have no whitespace around commas, sometimes you have whitespace after a comma. Idiomatic would be to have 1 space after a comma.
- Within parentheses: Sometimes, you have whitespace after the opening and before the closing parenthesis, sometimes not. Idiomatic would be to have no whitespace.
Operators
- Boolean keyword operators: Sometimes you use the operator symbols (&& / ||), sometimes you use the keywords (and / or). I cannot detect any logic as to when you use which. Idiomatic would be to use the symbol versions. Example:
```
if value < 0.0 || 1.0 < value
 return value
end
if value == 0 or value == 1.0
 return value
end
```
- Compound assignment: Sometimes, you use compound assignment (a += b) and sometimes the long form (a = a + b). Idiomatic would be to use the compound form, unless it severely hinders readability. Examples: variance1 += (mean1 - x)**2 and psq = psq + 1.0.
Naming: sometimes, you spell p-value pvalue and sometimes Pvalue. You should be consistent and pick one. Idiomatic would be pvalue or p_value. Example:
```
pvalue = calculate_Pvalue(d1,d2)
```

Naming

Snake Case: Names of methods, parameters, local variables, instance variables, class hierarchy variables, and global variables should be written in snake_case. Constants should be written in SCREAMING_SNAKE_CASE. The exception are constants that refer to modules or classes, and methods that are closely related to classes (E.g. Kernel#Hash, Kernel#Float), which are written in PascalCase. So, calculate_Pvalue should be either calculate_pvalue or calculate_p_value.
Intention-revealing names: Names should be intention-revealing. There are couple of names in your code that don't really help with understanding, e.g. temp and value.

Layout

Whitespace: Your code is all mashed together with no room to breathe. You should break up logical blocks with whitespace to make the flow easier to follow.
Indentation: You already start your code indented by 1 space. Your code should start with no indentation. Then, blocks should be indented bxy 2 spaces.
No whitespace after method name: both when sending a message (calling a method) and when defining a method, there should be no whitespace between the name and the opening parenthesis of the argument list / parameter list:
```
def calculate_Pvalue (array1, array2)
```

Semantics

Useless conditional:
```
value = 1.0 - value if indx
```
There is no way that indx can ever become falsy in your code (it will always be either 0 or 1, both of which are truthy), so this conditional is redundant and should just be:
```
value = 1.0 - value
```
Redundant conditionals:
```
if value < 0.0 || 1.0 < value
 return value
end
if value == 0 or value == 1.0
 return value
end
```
You return the value if it is less than zero and you also return the value if it is equal to zero. That is the same thing as returning the value if it is less than or equal to zero. Same for the other conditional. Replace with
```
if value <= 0.0 || 1.0 <= value
 return value
end
```
Use Range#cover?: Actually, the above conditional is even better written as
```
unless (0.0..1.0).cover?(value)
 return value
end
```
Use modifier-if for single-expression conditionals and guard clauses: prefer
```
return value unless (0.0..1.0).cover?(value)
```
Useless use of Float literals: In the above and many other places, you could equally well use Integer literals. An explicit Float literal is only required when you specifically need to call a Float method, e.g. for division. So, just use 1 and 0 instead.
Use Numeric#zero? predicate instead of testing for equality with 0: There are multiple places where you test for equality with 0. It would be more idiomatic to use the Numeric#zero? predicate instead:
```
return 1 if variance1.zero? && variance2.zero?
```
Avoid loops: In general, you should avoid loops as much as possible. You should prefer high-level iterators such as inject, map, select, etc. Only if that is not possible, use each. In case you need an infinite loop, use Kernel#loop. Don't use while true and most definitely do not use while 1. For example,
```
variance1 = 0
variance2 = 0
array1.each do |x|
 variance1 += (mean1 - x)**2
end
array2.each do |x|
 variance2 += (mean2 - x)**2
end
```
becomes
```
variance1 = array1.inject(0) {|acc, x| acc + (mean1 - x)**2 }
variance2 = array2.inject(0) {|acc, x| acc + (mean2 - x)**2 }
```
Redundant return: In Ruby, the last expression evaluated inside a block, method, module, or class is the return value. So, instead of return value, you can just write value.
Magic number: What does 0.57236494292470009 mean? You should give it a name.

*Use Array#first instead of [0]: Replace

beta = Math.lgamma(a)[0] + 0.57236494292470009 - Math.lgamma(a + 0.5)[0]

with

beta = Math.lgamma(a).first + 0.57236494292470009 - Math.lgamma(a + 0.5).first

freeze objects that are not meant to be modified, especially ones that are assigned to constants:

CORRECT_ANSWERS = [0.021378001462867, 0.148841696605327, 0.0359722710297968,
 0.090773324285671, 0.0107515611497845, 0.00339907162713746, 
 0.52726574965384, 0.545266866977794].freeze

Frozen string literals: In the same vein, you should make a habit of adding the magic comment
```
# frozen_string_literal: true
```
to all your files.
Commented out code: You should not comment out code. That's what a version control system is for.

Duplication: There is a fair amount of duplication in your code, e.g.

mean1 = array1.sum / array1.size
mean2 = array2.sum / array2.size

or

variance1 = 0.0
variance2 = 0.0
array1.each do |x|
 variance1 += (mean1 - x)**2
end
array2.each do |x|
 variance2 += (mean2 - x)**2
end

You could remove this duplication e.g. by extracting the logic into a method.

Complexity: The method is very complex, according to pretty much any metric (Cyclomatic Complexity, NPath Complexity, ABC, Perceived Complexity, ...). You should break it up into multiple methods.
Documentation: All your methods should have a documentation comment describing their purpose, usage, and pre- and post-conditions. Also, while in general, code should describe itself, in your case, you should document the algorithm and the names of the variables. Presumably, those names come from a well-known paper, in that case, you should cite that paper. (Note: in general, variable names like x, xx, pp, etc. should be avoided because they do not reveal the intent of the code, but here, I assume they actually are intention-revealing names because they are well-known to a reader versed in statistics. Still, you should document where those names come from.)

Question 3

thanks, this code is a translation of C, which in turn is a translation of Fortran, so that's where a lot of the quirkiness comes from. The variable names were given by someone else, like you said, I'm not entirely sure what they represent

Question 4

Some general guidelines:

Try to keep methods short (+- 10 lines max). Each method should serve a somewhat distinct purpose.
Use clear variable names

Prefer inline if-statements:

return 1.0 if array1.size <= 1

if array1.size <= 1
 return 1.0
end

Another case:

return 1.0 if mean1 == mean2

if mean1 == mean2
 return 1.0
end

A big one:

 return value if a <= 0
 return value if value < 0.0 || value > 1.0
 return value if (value == 0) || (value == 1.0)

if a <= 0
 return value
end
if value < 0.0 || 1.0 < value
 return value
end
if value == 0 or value == 1.0
 return value
end

You get the point.

Jörg W Mittag Jörg W Mittag 2,74417 silver badges17 bronze badges · Accepted Answer · 2019-09-06 19:49:40Z

Warnings

You should always turn warnings on during development. Your code generates 6 warnings, all of which are easy to fix:

test.rb:1: warning: parentheses after method name is interpreted as an argument list, not a decomposed argument
test.rb:51: warning: mismatched indentations at 'else' with 'if' at 45
test.rb:56: warning: mismatched indentations at 'end' with 'if' at 45
test.rb:93: warning: mismatched indentations at 'end' with 'while' at 64
test.rb:64: warning: literal in condition
test.rb:95: warning: mismatched indentations at 'end' with 'def' at 1

Formatting

You should use your text editor's auto-format feature.

Linting

If you want to make sure that your code is idiomatic, you should use an automated linting tool that checks for violations of Community Coding Style Guidelines, common mistakes, complexity, etc. One popular tool for this is Rubocop. If I run Rubocop with the default settings on your code, I get a whopping 408 violations. Thankfully, 391 of those can be auto-corrected, leaving only 17, of which 13 are trivially correctable.

Consistency

Consistency is very important when writing code. Actually, it is very important when writing anything technical. If there are two different ways to write something, and you use two different ways to write it, people are going to assume that you wanted to express two different things, and thus they will waste time trying to figure out why you expressed the same thing two different ways.

It is important not only to be consistent with yourself within your code, but also with the community as a whole. There are Community Coding Style Guidelines in pretty much every community, and usually, there are also tools that automatically format your code according to those guidelines.

Here are a couple of examples:

Indentation
- Amount: Sometimes, you indent by 1 column, sometimes by 2, sometimes by 4. There does not seem to be any logic whatsoever as to which amount of indentation you choose. Idiomatic would be 2. For example, the first three lines:
```
def calculate_Pvalue (array1, array2)
 if array1.size <= 1
 return 1.0
```
- When to indent: Sometimes, you indent the body of a loop or a conditional expression, sometimes you don't. Again, there does not seem to be any logic to it. Idiomatic would be to indent. Also, sometimes you indent just randomly in the middle of a sequence of expressions. Also in this case, I can't see any logic to it. Idiomatic would be to not indent.
Whitespace
- Around Operators: Sometimes, you put whitespace around operators, sometimes not. Idiomatic would be to use whitespace. Personally, I sometimes use different amounts of whitespace to emphasize precedence rules, e.g. I would write a*b + c, but I can't find such a logic in your case, for example here: variance1 / array1.size + variance2/array2.size
- Around Commas: Sometimes, you have no whitespace around commas, sometimes you have whitespace after a comma. Idiomatic would be to have 1 space after a comma.
- Within parentheses: Sometimes, you have whitespace after the opening and before the closing parenthesis, sometimes not. Idiomatic would be to have no whitespace.
Operators
- Boolean keyword operators: Sometimes you use the operator symbols (&& / ||), sometimes you use the keywords (and / or). I cannot detect any logic as to when you use which. Idiomatic would be to use the symbol versions. Example:
```
if value < 0.0 || 1.0 < value
 return value
end
if value == 0 or value == 1.0
 return value
end
```
- Compound assignment: Sometimes, you use compound assignment (a += b) and sometimes the long form (a = a + b). Idiomatic would be to use the compound form, unless it severely hinders readability. Examples: variance1 += (mean1 - x)**2 and psq = psq + 1.0.
Naming: sometimes, you spell p-value pvalue and sometimes Pvalue. You should be consistent and pick one. Idiomatic would be pvalue or p_value. Example:
```
pvalue = calculate_Pvalue(d1,d2)
```

Naming

Snake Case: Names of methods, parameters, local variables, instance variables, class hierarchy variables, and global variables should be written in snake_case. Constants should be written in SCREAMING_SNAKE_CASE. The exception are constants that refer to modules or classes, and methods that are closely related to classes (E.g. Kernel#Hash, Kernel#Float), which are written in PascalCase. So, calculate_Pvalue should be either calculate_pvalue or calculate_p_value.
Intention-revealing names: Names should be intention-revealing. There are couple of names in your code that don't really help with understanding, e.g. temp and value.

Layout

Whitespace: Your code is all mashed together with no room to breathe. You should break up logical blocks with whitespace to make the flow easier to follow.
Indentation: You already start your code indented by 1 space. Your code should start with no indentation. Then, blocks should be indented bxy 2 spaces.
No whitespace after method name: both when sending a message (calling a method) and when defining a method, there should be no whitespace between the name and the opening parenthesis of the argument list / parameter list:
```
def calculate_Pvalue (array1, array2)
```

Semantics

Useless conditional:
```
value = 1.0 - value if indx
```
There is no way that indx can ever become falsy in your code (it will always be either 0 or 1, both of which are truthy), so this conditional is redundant and should just be:
```
value = 1.0 - value
```
Redundant conditionals:
```
if value < 0.0 || 1.0 < value
 return value
end
if value == 0 or value == 1.0
 return value
end
```
You return the value if it is less than zero and you also return the value if it is equal to zero. That is the same thing as returning the value if it is less than or equal to zero. Same for the other conditional. Replace with
```
if value <= 0.0 || 1.0 <= value
 return value
end
```
Use Range#cover?: Actually, the above conditional is even better written as
```
unless (0.0..1.0).cover?(value)
 return value
end
```
Use modifier-if for single-expression conditionals and guard clauses: prefer
```
return value unless (0.0..1.0).cover?(value)
```
Useless use of Float literals: In the above and many other places, you could equally well use Integer literals. An explicit Float literal is only required when you specifically need to call a Float method, e.g. for division. So, just use 1 and 0 instead.
Use Numeric#zero? predicate instead of testing for equality with 0: There are multiple places where you test for equality with 0. It would be more idiomatic to use the Numeric#zero? predicate instead:
```
return 1 if variance1.zero? && variance2.zero?
```
Avoid loops: In general, you should avoid loops as much as possible. You should prefer high-level iterators such as inject, map, select, etc. Only if that is not possible, use each. In case you need an infinite loop, use Kernel#loop. Don't use while true and most definitely do not use while 1. For example,
```
variance1 = 0
variance2 = 0
array1.each do |x|
 variance1 += (mean1 - x)**2
end
array2.each do |x|
 variance2 += (mean2 - x)**2
end
```
becomes
```
variance1 = array1.inject(0) {|acc, x| acc + (mean1 - x)**2 }
variance2 = array2.inject(0) {|acc, x| acc + (mean2 - x)**2 }
```
Redundant return: In Ruby, the last expression evaluated inside a block, method, module, or class is the return value. So, instead of return value, you can just write value.
Magic number: What does 0.57236494292470009 mean? You should give it a name.

*Use Array#first instead of [0]: Replace

beta = Math.lgamma(a)[0] + 0.57236494292470009 - Math.lgamma(a + 0.5)[0]

with

beta = Math.lgamma(a).first + 0.57236494292470009 - Math.lgamma(a + 0.5).first

freeze objects that are not meant to be modified, especially ones that are assigned to constants:

CORRECT_ANSWERS = [0.021378001462867, 0.148841696605327, 0.0359722710297968,
 0.090773324285671, 0.0107515611497845, 0.00339907162713746, 
 0.52726574965384, 0.545266866977794].freeze

Frozen string literals: In the same vein, you should make a habit of adding the magic comment
```
# frozen_string_literal: true
```
to all your files.
Commented out code: You should not comment out code. That's what a version control system is for.

Duplication: There is a fair amount of duplication in your code, e.g.

mean1 = array1.sum / array1.size
mean2 = array2.sum / array2.size

or

variance1 = 0.0
variance2 = 0.0
array1.each do |x|
 variance1 += (mean1 - x)**2
end
array2.each do |x|
 variance2 += (mean2 - x)**2
end

You could remove this duplication e.g. by extracting the logic into a method.

Complexity: The method is very complex, according to pretty much any metric (Cyclomatic Complexity, NPath Complexity, ABC, Perceived Complexity, ...). You should break it up into multiple methods.
Documentation: All your methods should have a documentation comment describing their purpose, usage, and pre- and post-conditions. Also, while in general, code should describe itself, in your case, you should document the algorithm and the names of the variables. Presumably, those names come from a well-known paper, in that case, you should cite that paper. (Note: in general, variable names like x, xx, pp, etc. should be avoided because they do not reveal the intent of the code, but here, I assume they actually are intention-revealing names because they are well-known to a reader versed in statistics. Still, you should document where those names come from.)

thanks, this code is a translation of C, which in turn is a translation of Fortran, so that's where a lot of the quirkiness comes from. The variable names were given by someone else, like you said, I'm not entirely sure what they represent

Stack Exchange Network

Calculating p-value for Welch's t-test

2 Answers 2

Warnings

Formatting

Linting

Consistency

Naming

Layout

Semantics

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Calculating p-value for Welch's t-test

2 Answers 2

Warnings

Formatting

Linting

Consistency

Naming

Layout

Semantics

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions