I noticed there's some code on StackOverflow that is, to put it mildly, suboptimal. I'm posting here to see if we can provide a better solution.
Here's a proposal for fixing it.
The approach is to provide a small utility class with static methods, so it can be used just like Math.method(...)
.
The standard deviation implements this formula, which applies to a sample.
The median function is taken from this answer, which is based on the reasonable assumption that the array can be sorted in memory.
import java.util.Arrays;
public class Statistics {
static public double getMean(double[] nums) {
double total = 0;
for (double value : nums) {
total += value;
}
return total / nums.length;
}
// sample version
static public double getVariance(double[] nums) {
if (nums.length <= 1) {
return 0;
}
double sum = 0;
double mean = getMean(nums);
for (double value : nums) {
sum += Math.pow(value - mean, 2);
}
return sum / (nums.length - 1); // notice the -1
}
// sample version
static public double getStdDev(double[] nums) {
return Math.sqrt(getVariance(nums));
}
static public double getMedian(double[] nums) {
Arrays.sort(nums);
if (nums.length % 2 == 0) {
return ((double) nums[nums.length / 2] + (double) nums[nums.length / 2 - 1]) / 2;
} else {
return (double) nums[nums.length / 2];
}
}
}
Is there something you would add?
Assuming this works well, it only works for double (and float) arrays.
It does not work for int arrays, nor does it work for Container
classes, although all it would require to get it to work would be a copy-paste and some minimal changes.
Could Java 8 lambdas help make the code more generic and reduce the need for copy-paste?
Feel free to comment.
1 Answer 1
First of all, I object to the get...
naming convention. "Get" implies that you are retrieving something that already exists (usually, though not always, paired with "set"). You wouldn't call Math.getCos(theta)
, would you?
The problems stem from accepting arrays in the first place. The class would be much more useful as an accumulator, like this:
Statistics stats = new Statistics();
stats.datum(3);
stats.datum(4);
stats.datum(5);
System.out.println(stats.mean());
System.out.println(stats.stdDev());
A simple way to calculate the mean, variance, and standard deviation is to keep running totals \$\sum x_i^0\$ (i.e., the count), \$\sum x_i^1\$ (i.e., the sum), and \$\sum x_i^2\$ (i.e., the sum of the squares). However, see Algorithms for calculating variance for a discussion of the merits of this method compared to others.
You could write
import java.util.OptionalDouble;
public class Statistics {
private int sum0;
private double sum1, sum2;
public void datum(double x) {
this.sum0++;
this.sum1 += x;
this.sum2 += x * x;
}
public int count() {
return this.sum0;
}
public double sum() {
return this.sum1;
}
public OptionalDouble mean() {
if (this.count() == 0) {
return OptionalDouble.empty();
} else {
return OptionalDouble.of(this.sum() / this.count());
}
}
public OptionalDouble variance() {
if (this.count() == 0) {
return OptionalDouble.empty();
} else {
return OptionalDouble.of(
(this.sum2 - this.sum() * this.sum() / this.count())
/ //////////////////////////////////////////////// /
this.count()
);
}
}
public OptionalDouble stdDev() {
if (this.count() == 0) {
return OptionalDouble.empty();
} else {
return OptionalDouble.of(Math.sqrt(this.variance().getAsDouble()));
}
}
}
The median is trickier to calculate, as you would have to keep a list of all of the values.
Note that DoubleStream
already gives you count()
, sum()
, and average()
.
-
\$\begingroup\$ I agree with removing the
get
part. I'm not sure the accumulator approach would be useful. Data can be accumulated outside of the class, and it usually is stored by other means, as you pointed out, there already are classes for that, such asDoubleStream
. When somebody wants to calculate the median, they probably want to do that in one shot. \$\endgroup\$Agostino– Agostino2015年04月14日 16:53:03 +00:00Commented Apr 14, 2015 at 16:53 -
\$\begingroup\$ So call
Arrays.stream(nums).forEach(stats::datum);
to pass the data to thestats
object. \$\endgroup\$200_success– 200_success2015年04月14日 17:06:14 +00:00Commented Apr 14, 2015 at 17:06 -
\$\begingroup\$ I'm not so sure that's KISS. It looks very close to the C++ iterator approach to me. Which may or may not be a good thing. It sure takes more space, though. \$\endgroup\$Agostino– Agostino2015年04月14日 17:11:33 +00:00Commented Apr 14, 2015 at 17:11
-
1\$\begingroup\$ Here's a post about calculating variance online, accounting for numerical precision (which is something we could consider). It provides C++ code as well. I like how the class is called
RunningStat
. Still, no median though. \$\endgroup\$Agostino– Agostino2015年04月15日 17:00:30 +00:00Commented Apr 15, 2015 at 17:00