(See the first and initial iteration.)
Now I make only one pass over the data in order to compute the standard deviation of the input Number
s, and that's how:
\begin{align} \sum_{i = 1}^n (x_i - \mu)^2 &= \sum_{i = 1}^n (x_i^2 - 2\mu x_i + \mu^2) \\ &= \sum_{i = 1}^n x_i^2 - 2\mu \sum_{i = 1}^n x_i + n \mu^2 \\ &= \sum_{i = 1}^n x_i^2 - 2 \Bigg( \frac{\sum_{i = 1}^n x_i}{n} \Bigg) \sum_{i = 1}^n x_i + n \Bigg( \frac{\sum_{i = 1}^n x_i}{n} \Bigg)^2 \\ &= \sum_{i = 1}^n x_i^2 - 2 \Bigg( \frac{\sum_{i = 1}^n x_i}{n} \Bigg) \sum_{i = 1}^n x_i + \frac{1}{n}\Bigg( \sum_{i = 1}^n x_i \Bigg)^2 \\ &= \sum_{i = 1}^n x_i^2 - \frac{2}{n} \Bigg( \sum_{i = 1}^n x_i \Bigg) \sum_{i = 1}^n x_i + \frac{1}{n}\Bigg( \sum_{i = 1}^n x_i \Bigg)^2 \\ &= \sum_{i = 1}^n x_i^2 - \frac{2}{n} \Bigg( \sum_{i = 1}^n x_i \Bigg)^2 + \frac{1}{n}\Bigg( \sum_{i = 1}^n x_i \Bigg)^2 \\ &= \sum_{i = 1}^n x_i^2 - \frac{1}{n} \Bigg( \sum_{i = 1}^n x_i \Bigg)^2. \end{align}
My code is as follows:
StandardDeviation.java:
package net.coderodde.util;
import java.util.Arrays;
import java.util.Collection;
import java.util.Objects;
import java.util.function.Consumer;
public final class StandardDeviation {
public static double computeStandardDeviation(final Number... array) {
Objects.requireNonNull(array, "The input number array is null.");
if (array.length == 0) {
return Double.NaN;
}
MyConsumer myConsumer = new MyConsumer();
Arrays.stream(array)
.map(Number::doubleValue)
.forEach(myConsumer);
int n = array.length;
double sum = myConsumer.getSum();
double sumOfSquares = myConsumer.getSumOfSquares();
double intermediate = sumOfSquares - sum * sum / n;
return Math.sqrt(intermediate / (n - 1));
}
public static double
computeStandardDeviation(final Collection<Number> collection) {
Objects.requireNonNull(collection,
"The input number collection is null");
return computeStandardDeviation(
collection.toArray(new Number[collection.size()]));
}
private StandardDeviation() {}
private static final class MyConsumer implements Consumer<Double> {
private double sum;
private double sumOfSquares;
@Override
public void accept(Double t) {
sum += t;
sumOfSquares += t * t;
}
double getSum() {
return sum;
}
double getSumOfSquares() {
return sumOfSquares;
}
}
public static void main(String[] args) {
// Mix 'em all!
double sd = computeStandardDeviation(Arrays.asList((byte) 1,
(short) 2,
3,
4L,
5.0f,
6.0));
System.out.println(sd);
}
}
Any critique much appreciated.
1 Answer 1
Boxing
Your current class MyConsumer
implements Consumer<Double>
. This means that every time an element is accepted, it will unboxed into a double
.
@Override
public void accept(Double t) {
sum += t; // <-- unboxed into double
sumOfSquares += t * t; // <-- here also
}
On large datasets, this can have a huge performance impact. Instead, it would be preferable to implement the primitive DoubleConsumer
interface, and thus change your accept
method into
@Override
public void accept(double t) {
// ...
}
This way, there won't be any overhead involving autoboxing.
This will also involve a change inside your main method. You will need to use mapToDouble
instead of map
.
Arrays.stream(array)
.mapToDouble(Number::doubleValue)
This is also a good change to do: currently, each Number
would be converted to a double
, only to be boxed into a Double
by map
, then unboxed
multiple times inside the consumer.
collect
instead of forEach
You aren't using properly the Stream API in your main pipeline:
MyConsumer myConsumer = new MyConsumer();
Arrays.stream(array)
.map(Number::doubleValue)
.forEach(myConsumer);
The problem with this code is that it mixed both the functional approach of the Stream API with the traditional iterative forEach
. This can generally break in multiple ways when you're using a parallel pipeline.
The Stream API was designed so that parallel capabilities can be introduced effectively. The case of forEach
is even mentioned in the documentation:
A small number of stream operations, such as
forEach()
andpeek()
, can operate only via side-effects; these should be used with care.
[...]
theforEach()
can simply be replaced with a reduction operation that is safer, more efficient, and more amenable to parallelization.
This is what collect
(and the more general reduce
) is for.
So let's make that parallel-friendly and take a look at collect(supplier, accumulator, combiner)
method.
- The supplier is a function returning the new result container.
- The accumulator is a function that takes the current result container and the current Stream element and incorporates it into the container.
- The combiner is a function that takes two result containers and merges them into one. This argument is only used in case of a parallel pipeline.
To adhere with that contract, we need to change MyConsumer
:
MyConsumer
isn't really a consumer anymore, it will be a class responsible for collecting the values into statistics having notably asum
andsumOfSquares
. Let's rename itStatCollector
(for lack of a better name). It does consumedouble
as input so it makes sense to let it implementDoubleConsumer
.- As seen by the above signature, it needs to handle a case of merging two containers, so we need to add a method
public void combine(StatCollector other)
that will be responsible for that.
This is what we end up with:
private static final class StatCollector implements DoubleConsumer {
private double sum;
private double sumOfSquares;
@Override
public void accept(double t) {
sum += t;
sumOfSquares += t * t;
}
public void combine(StatCollector other) {
sum += other.sum;
sumOfSquares += other.sumOfSquares;
}
double getSum() {
return sum;
}
double getSumOfSquares() {
return sumOfSquares;
}
}
used with
StatCollector stat =
Arrays.stream(array)
.mapToDouble(Number::doubleValue)
.collect(StatCollector::new, StatCollector::accept, StatCollector::combine);