Computing the standard deviation of a Number array in Java - follow-up

Question 1

Now I make only one pass over the data in order to compute the standard deviation of the input Numbers, and that's how:

\begin{align} \sum_{i = 1}^n (x_i - \mu)^2 &= \sum_{i = 1}^n (x_i^2 - 2\mu x_i + \mu^2) \\ &= \sum_{i = 1}^n x_i^2 - 2\mu \sum_{i = 1}^n x_i + n \mu^2 \\ &= \sum_{i = 1}^n x_i^2 - 2 \Bigg( \frac{\sum_{i = 1}^n x_i}{n} \Bigg) \sum_{i = 1}^n x_i + n \Bigg( \frac{\sum_{i = 1}^n x_i}{n} \Bigg)^2 \\ &= \sum_{i = 1}^n x_i^2 - 2 \Bigg( \frac{\sum_{i = 1}^n x_i}{n} \Bigg) \sum_{i = 1}^n x_i + \frac{1}{n}\Bigg( \sum_{i = 1}^n x_i \Bigg)^2 \\ &= \sum_{i = 1}^n x_i^2 - \frac{2}{n} \Bigg( \sum_{i = 1}^n x_i \Bigg) \sum_{i = 1}^n x_i + \frac{1}{n}\Bigg( \sum_{i = 1}^n x_i \Bigg)^2 \\ &= \sum_{i = 1}^n x_i^2 - \frac{2}{n} \Bigg( \sum_{i = 1}^n x_i \Bigg)^2 + \frac{1}{n}\Bigg( \sum_{i = 1}^n x_i \Bigg)^2 \\ &= \sum_{i = 1}^n x_i^2 - \frac{1}{n} \Bigg( \sum_{i = 1}^n x_i \Bigg)^2. \end{align}

My code is as follows:

StandardDeviation.java:

package net.coderodde.util;
import java.util.Arrays;
import java.util.Collection;
import java.util.Objects;
import java.util.function.Consumer;
public final class StandardDeviation {
 public static double computeStandardDeviation(final Number... array) {
 Objects.requireNonNull(array, "The input number array is null.");
 if (array.length == 0) {
 return Double.NaN;
 }
 MyConsumer myConsumer = new MyConsumer();
 Arrays.stream(array)
 .map(Number::doubleValue)
 .forEach(myConsumer);
 int n = array.length;
 double sum = myConsumer.getSum();
 double sumOfSquares = myConsumer.getSumOfSquares();
 double intermediate = sumOfSquares - sum * sum / n;
 return Math.sqrt(intermediate / (n - 1));
 }
 public static double
 computeStandardDeviation(final Collection<Number> collection) {
 Objects.requireNonNull(collection,
 "The input number collection is null");
 return computeStandardDeviation(
 collection.toArray(new Number[collection.size()]));
 }
 private StandardDeviation() {}
 private static final class MyConsumer implements Consumer<Double> {
 private double sum;
 private double sumOfSquares;
 @Override
 public void accept(Double t) {
 sum += t;
 sumOfSquares += t * t;
 }
 double getSum() {
 return sum;
 }
 double getSumOfSquares() {
 return sumOfSquares;
 }
 }
 public static void main(String[] args) {
 // Mix 'em all!
 double sd = computeStandardDeviation(Arrays.asList((byte) 1,
 (short) 2,
 3,
 4L,
 5.0f,
 6.0));
 System.out.println(sd);
 }
}

Any critique much appreciated.

Question 2

Boxing

Your current class MyConsumer implements Consumer<Double>. This means that every time an element is accepted, it will unboxed into a double.

@Override
public void accept(Double t) {
 sum += t; // <-- unboxed into double
 sumOfSquares += t * t; // <-- here also
}

On large datasets, this can have a huge performance impact. Instead, it would be preferable to implement the primitive DoubleConsumer interface, and thus change your accept method into

@Override
public void accept(double t) {
 // ...
}

This way, there won't be any overhead involving autoboxing.

This will also involve a change inside your main method. You will need to use mapToDouble instead of map.

Arrays.stream(array)
 .mapToDouble(Number::doubleValue)

This is also a good change to do: currently, each Number would be converted to a double, only to be boxed into a Double by map, then unboxed multiple times inside the consumer.

`collect` instead of `forEach`

You aren't using properly the Stream API in your main pipeline:

MyConsumer myConsumer = new MyConsumer();
Arrays.stream(array)
 .map(Number::doubleValue)
 .forEach(myConsumer);

The problem with this code is that it mixed both the functional approach of the Stream API with the traditional iterative forEach. This can generally break in multiple ways when you're using a parallel pipeline.

The Stream API was designed so that parallel capabilities can be introduced effectively. The case of forEach is even mentioned in the documentation:

A small number of stream operations, such as forEach() and peek(), can operate only via side-effects; these should be used with care.
[...]
the forEach() can simply be replaced with a reduction operation that is safer, more efficient, and more amenable to parallelization.

This is what collect (and the more general reduce) is for.

So let's make that parallel-friendly and take a look at collect(supplier, accumulator, combiner) method.

The supplier is a function returning the new result container.
The accumulator is a function that takes the current result container and the current Stream element and incorporates it into the container.
The combiner is a function that takes two result containers and merges them into one. This argument is only used in case of a parallel pipeline.

To adhere with that contract, we need to change MyConsumer:

MyConsumer isn't really a consumer anymore, it will be a class responsible for collecting the values into statistics having notably a sum and sumOfSquares. Let's rename it StatCollector (for lack of a better name). It does consume double as input so it makes sense to let it implement DoubleConsumer.
As seen by the above signature, it needs to handle a case of merging two containers, so we need to add a method public void combine(StatCollector other) that will be responsible for that.

This is what we end up with:

private static final class StatCollector implements DoubleConsumer {
 private double sum;
 private double sumOfSquares;
 @Override
 public void accept(double t) {
 sum += t;
 sumOfSquares += t * t;
 }
 
 public void combine(StatCollector other) {
 sum += other.sum;
 sumOfSquares += other.sumOfSquares;
 }
 double getSum() {
 return sum;
 }
 double getSumOfSquares() {
 return sumOfSquares;
 }
}

used with

StatCollector stat = 
 Arrays.stream(array)
 .mapToDouble(Number::doubleValue)
 .collect(StatCollector::new, StatCollector::accept, StatCollector::combine);

Tunaki Tunaki 9,3011 gold badge31 silver badges46 bronze badges · Answer 1 · 2016-04-12 19:03:19Z

Boxing

Your current class MyConsumer implements Consumer<Double>. This means that every time an element is accepted, it will unboxed into a double.

@Override
public void accept(Double t) {
 sum += t; // <-- unboxed into double
 sumOfSquares += t * t; // <-- here also
}

On large datasets, this can have a huge performance impact. Instead, it would be preferable to implement the primitive DoubleConsumer interface, and thus change your accept method into

@Override
public void accept(double t) {
 // ...
}

This way, there won't be any overhead involving autoboxing.

This will also involve a change inside your main method. You will need to use mapToDouble instead of map.

Arrays.stream(array)
 .mapToDouble(Number::doubleValue)

This is also a good change to do: currently, each Number would be converted to a double, only to be boxed into a Double by map, then unboxed multiple times inside the consumer.

`collect` instead of `forEach`

You aren't using properly the Stream API in your main pipeline:

MyConsumer myConsumer = new MyConsumer();
Arrays.stream(array)
 .map(Number::doubleValue)
 .forEach(myConsumer);

The problem with this code is that it mixed both the functional approach of the Stream API with the traditional iterative forEach. This can generally break in multiple ways when you're using a parallel pipeline.

The Stream API was designed so that parallel capabilities can be introduced effectively. The case of forEach is even mentioned in the documentation:

A small number of stream operations, such as forEach() and peek(), can operate only via side-effects; these should be used with care.
[...]
the forEach() can simply be replaced with a reduction operation that is safer, more efficient, and more amenable to parallelization.

This is what collect (and the more general reduce) is for.

So let's make that parallel-friendly and take a look at collect(supplier, accumulator, combiner) method.

The supplier is a function returning the new result container.
The accumulator is a function that takes the current result container and the current Stream element and incorporates it into the container.
The combiner is a function that takes two result containers and merges them into one. This argument is only used in case of a parallel pipeline.

To adhere with that contract, we need to change MyConsumer:

MyConsumer isn't really a consumer anymore, it will be a class responsible for collecting the values into statistics having notably a sum and sumOfSquares. Let's rename it StatCollector (for lack of a better name). It does consume double as input so it makes sense to let it implement DoubleConsumer.
As seen by the above signature, it needs to handle a case of merging two containers, so we need to add a method public void combine(StatCollector other) that will be responsible for that.

This is what we end up with:

private static final class StatCollector implements DoubleConsumer {
 private double sum;
 private double sumOfSquares;
 @Override
 public void accept(double t) {
 sum += t;
 sumOfSquares += t * t;
 }
 
 public void combine(StatCollector other) {
 sum += other.sum;
 sumOfSquares += other.sumOfSquares;
 }
 double getSum() {
 return sum;
 }
 double getSumOfSquares() {
 return sumOfSquares;
 }
}

used with

StatCollector stat = 
 Arrays.stream(array)
 .mapToDouble(Number::doubleValue)
 .collect(StatCollector::new, StatCollector::accept, StatCollector::combine);

Stack Exchange Network

Computing the standard deviation of a Number array in Java - follow-up

1 Answer 1

Boxing

`collect` instead of `forEach`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Computing the standard deviation of a Number array in Java - follow-up

1 Answer 1

Boxing

collect instead of forEach

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

`collect` instead of `forEach`