I have this data structure to which the user can add number or remove them, both in constant time. Also, there is a \$O(1)\$ operation for querying the standard deviation of the current contents of the structure:
StandardDeviationNumberSet.java
package net.coderodde.stat;
import java.util.HashMap;
import java.util.Map;
public class StandardDeviationNumberSet {
private double sum;
private double squareSum;
private int numberOfElements;
private final Map<Double, Integer> countMap = new HashMap<>();
public void add(Number number) {
double numberValue = number.doubleValue();
Integer count = countMap.get(numberValue);
if (count != null) {
countMap.put(numberValue, count + 1);
} else {
countMap.put(numberValue, 1);
}
sum += numberValue;
squareSum += numberValue * numberValue;
numberOfElements++;
}
public double remove(Number number) {
double numberValue = number.doubleValue();
Integer count = countMap.get(numberValue);
if (count != null) {
if (count > 1) {
countMap.put(numberValue, count - 1);
} else {
countMap.remove(numberValue);
}
sum -= numberValue;
squareSum -= numberValue * numberValue;
numberOfElements--;
return numberValue;
} else {
return Double.NaN;
}
}
public double getStandardDeviation() {
checkSetHasAtleastTwoElements();
double step1 = squareSum - sum * sum / numberOfElements;
double step2 = step1 / (numberOfElements - 1);
return Math.sqrt(step2);
}
private void checkSetHasAtleastTwoElements() {
if (numberOfElements < 2) {
throw new IllegalStateException(
"Computing the standard deviation requires at least two " +
"elements.");
}
}
public static void main(String[] args) {
StandardDeviationNumberSet set = new StandardDeviationNumberSet();
set.add(1);
set.add(1);
set.add(3);
System.out.println(set.getStandardDeviation());
set.remove(1);
System.out.println(set.getStandardDeviation());
}
}
As always, any critique is much appreciated.
2 Answers 2
Overall I don't see any major problems with the code. There are some things I would change though:
Write unit tests.
Whenever I deal with mathematical classes, I think that it is obligatory to provide unit tests to verify correct computation and computational precision.
Add JavaDoc
Please add a JavaDoc to clarify whether it computes sample or population statistics.
Add comments
The use numberValue
instead of number
in the look ups to countMap
has a significant but subtle difference that is easy to miss if you're not an expert. I would add comments to clarify this (you are correctly avoiding Integer(3)
and Double(3)
getting different counts).
Use map.getOrDefault
Since Java 8, which we have to assume is wide spread by this point, you can replace:
Integer count = countMap.get(numberValue);
if (count != null) {
countMap.put(numberValue, count + 1);
} else {
countMap.put(numberValue, 1);
}
with
countMap.put(numberValue, countMap.getOrDefault(numberValue, 0) + 1);
Consider handling NaN
and Inf
inputs
Currently your code will silently break if any of the inputs contain NaN
or Infinity
. I would consider testing for these inputs and throwing if they are encountered.
Return type of remove
is weird
As far as I can tell remove
will return the input value or NaN
if the input value was not removed. To me this feels very strange, I would much rather it'd return true if the statistics changed and false otherwise.
Change API to use double
instead of Number
Currently your methods like public void add(Number number)
first unbox the number into a double
then rebox it into a Double
. So calling add
with a Double
causes unnecessary boxing and unboxing.
Further more calling add
with any of the primitive types (int
,long
,float
etc) causes first a boxing to Number
, then an unboxing to double
then a re-boxing to Double
(to use in countMap.get
). This is a lot of unnecessary boxing and unboxing and will also generate a compiler warning on many systems.
So I would change:
public void add(Number number) {
double numberValue = number.doubleValue();
Integer count = countMap.get(numberValue);
if (count != null) {
countMap.put(numberValue, count + 1);
} else {
countMap.put(numberValue, 1);
}
sum += numberValue;
squareSum += numberValue * numberValue;
numberOfElements++;
}
to:
public void add(double number) {
// JIT will use CSE to consolidate the auto boxings of number.
countMap.put(number, countMap.getOrDefault(number, 0) + 1);
sum += number;
squareSum += number* number;
numberOfElements++;
}
public void add(Number number){
add(number.doubleValue());
}
and similarily for remove
.
I would prefer to use the primitive double
in the API because more often than not the user will do some computations to end up with the input value and will thus have a primitive double
that they want to pass in. Other primitive types will automatically be type promoted to double. In the odd case that the user actually has a Number
of some sort, we add an overload to handle it.
Add methods for querying the count and mean as well
You have all the data to get the mean value and sample count as well and these are commonly used in conjunction with the standard deviation so to me it makes sense to provide these methods as well as they are trivial.
Edit/Addendum
I would consider leaving the validation of removed values to the user. By having the countMap
you add \$O\left(n\right)\$ memory requirement even if the user doesn't need the remove ability. And in many cases where you do need a remove (like a sliding window) the removal is guaranteed to be valid by the definition of the window.
I only have a very few nitpicks:
It's not technically \$O(1)\,ドル it's amortized constant time (\$O(1)\$), as you distribute cost of the calculation over the additions / removals. Not really an issue, but maybe you should make that distinction in your documentation.
Use
countMap.put(number, ...)
instead ofcountMap.put(numberValue, ...)
, as it preserves the caller's object identities that way. This would entail changing the type parameterization ofcountMap
to beMap<Number, Integer>
, though.Considering comments (especially by @EmilyL), you should leave the implementation as-is, but you should put in a comment that you are only interested in the numeric value, not the actual
Number
object.Consider marking your formal parameters as
final
, as you do not mutate them.Make
remove()
's return typevoid
, as the user has no need to get the same number which was passed as an argument returned. Rather, throw an exception, preferablyjava.util.NoSuchElementException
if the relevantnumber
could not be found incountMap
.An alternative to (4) is to make
remove()
's return typeboolean
, and returntrue
on successful removal ofnumber
from the set, orfalse
if the removal was not successful.Instead of throwing an
IllegalStateException
when asked to compute the standard deviation when the set has less than 2 elements, returnDouble.NaN
for such sets, as in the 0- & 1-element sample sizes, the standard deviation is mathematically undefined. That does not warrant an exception, as there is a value which indicates exactly undefined:NaN
. Here I assume that the standard deviation is calculated for a sample, not a population, because in that case the standard deviation for 0- & 1-element sets would be 0 (although computation of population statistics is not common in practical usage).
-
\$\begingroup\$ Using
countMap.put(number
instead ofnumberValue
is incorrect from a mathematical point of view becauseassertEquals(new Integer(3), new Double(3.0))
will fail. And thus integer 3 and double 3.0 would appear as different counts although they are the same value. OPs code is correct. Although it should be commented because it is far from obvious why it is done that way. \$\endgroup\$Emily L.– Emily L.2017年03月27日 08:00:59 +00:00Commented Mar 27, 2017 at 8:00 -
\$\begingroup\$ @Emily L. HashMap grows as well when the load factor is exceeded. It's still constant time amortized. \$\endgroup\$coderodde– coderodde2017年03月27日 10:31:48 +00:00Commented Mar 27, 2017 at 10:31
-
\$\begingroup\$ @coderodde yeah, my bad. Forgot to consider the put operation. \$\endgroup\$Emily L.– Emily L.2017年03月27日 10:49:39 +00:00Commented Mar 27, 2017 at 10:49
-
\$\begingroup\$ @EmilyL. Right you are. I clarified. However, keep your comment around as otherwise my answer loses context. \$\endgroup\$Tamoghna Chowdhury– Tamoghna Chowdhury2017年03月27日 11:20:33 +00:00Commented Mar 27, 2017 at 11:20