I find that I often want to calculate summary statistics from collections of data and decided to create a template to do that. I'm interested in a general review, but I also have some specific questions.
- Is the use of the lambda "too clever?"
- Is
std::accumulate
OK? Specifically, I'm thinking of the potential for overflow for large collections of small data types. - Is the template missing anything?
- Should I be using something such as
std::enable_if
to assure that theIter
type is actually an iterator?
The code to be reviewed is in stats.h
but stats.cpp
shows how it can be used.
stats.h
#ifndef STATS_H
#define STATS_H
#include <cmath>
#include <numeric>
#include <iterator>
template <typename T, typename Iter>
struct Statistics {
Statistics(Iter start, Iter end, bool sample=true) :
n(std::distance(start, end)),
sum(std::accumulate(start, end, 0)),
mean(sum / n),
variance([&start, &end, sample, this]()->T {
T sumsq = 0;
for (auto it=start; it != end; ++it) {
sumsq += (*it-mean)*(*it-mean);
}
return sumsq/(n- (sample ? 1 : 0));
}()
),
sdev(sqrt(variance))
{
}
unsigned n;
T sum;
T mean;
T variance;
T sdev;
};
#endif //STATS_H
stats.cpp
#include <iostream>
#include <vector>
#include <list>
#include "stats.h"
template <typename T>
std::ostream &print(std::ostream &out, const T &stats) {
out << "n = " << stats.n << '\n';
out << "sum = " << stats.sum << '\n';
out << "mean = " << stats.mean << '\n';
out << "var = " << stats.variance << '\n';
out << "sdev = " << stats.sdev << '\n';
return out;
}
int main()
{
std::vector<int> sample{9, 2, 5, 4, 12, 7};
Statistics<float, decltype(sample.cbegin())> stats(sample.cbegin(), sample.cend());
print(std::cout, stats);
std::list<float> sample2{2.0, 4.0, 5.0, 7.0, 9.0, 12.0};
Statistics<double, decltype(sample2.cbegin())> stats2(sample2.cbegin(), sample2.cend());
print(std::cout, stats2);
}
Sample output
n = 6
sum = 39
mean = 6.5
var = 13.1
sdev = 3.61939
n = 6
sum = 39
mean = 6.5
var = 13.1
sdev = 3.61939
-
\$\begingroup\$ The lambda isn't too bad. I would argue the too clever part is actually the fact that your initialization depends on the declaration order of the variables. If some maintainer comes along and for some reason (OCD or other) rearranges the order of the variable declarations, your initialization code becomes incorrect. If you performed all of those calculations in the constructor, your code would be more robust to such changes. I can't imagine there would be any performance hit at all in a release/optimized compilation. \$\endgroup\$jliv902– jliv9022015年01月26日 19:41:10 +00:00Commented Jan 26, 2015 at 19:41
-
1\$\begingroup\$ Just as a note, there is an accumulator library in boost that does this and a lot, lot more. \$\endgroup\$Yuushi– Yuushi2015年01月29日 15:36:30 +00:00Commented Jan 29, 2015 at 15:36
3 Answers 3
Use of T
for accumulator and result
I question the decision to use a template parameter T
as the result and accumulator type. Especially in the calculation of the variance, using T
as accumulator and mean value will cause significant errors in the calculated standard deviation and variance. The mean value may be off by at most \$\lim_{x\rightarrow 1}x\$. I cannot think of any situation where if you are interested in statistics, you would want to use anything else than double
.
Clever lambda
While the lambda trick is easy to understand, I question the necessity of it when the same code could have been implemented in the constructor body and with less typing. This would also allow to use double for the mean internally and only truncate to T
after assigning, this would bound the error on the variance and standard deviation. Edit: at least if the result is within the domain of T
.
Use of std::accumulate
For small types, say short
the risk of overflow is large. For double
you run the risk of truncation errors when there are many samples. For accuracy see my answer here: Function that normalizes a vector of double. The same applies here and depends on the level of accuracy that you require, seeing as it is statistics, I wouldn't be too worried.
What about min/max value?
One stat that I'm missing is the maximum and minimum value.
Performance
What if I don't need the standard deviation but only the mean? You might want to implement the sum, mean and stdev as separate classes so the user can choose if she only needs the mean for example.
Naming
I find the name Statistics
a bit too generic. The name may suggest statistics about anything (like for example memory usage for the day) but here it means the sample count, sum, mean, variance and standard deviation.
Documentation
The variable sample
wasn't immediately obvious that it was used to distinguish between sample and population statistics. A comment for the constructor explaining the arguments might be in order.
Checking for iterator type
The compiler will emit an error if the type used for iterator isn't suitable, so strictly speaking this isn't really necessary if it complicates the code. Also I'm not completely sure how you would test that a fulfills the iterator concept in a useful way. I mean you can test for the existence of std::iterator_trait<Iter>::iterator_category
but that can have false negatives.
But I never turn down a nice error message telling me what I'm doing wrong instead of a horrendous template substitution failure wall of text that MSVC emits... :)
-
\$\begingroup\$ Thanks for the review. Regarding the use of
T
as a template parameter, I actually had written it asdouble
and then made a last minute change. Based on your feedback, I think I'll change it back. \$\endgroup\$Edward– Edward2015年01月26日 18:41:41 +00:00Commented Jan 26, 2015 at 18:41 -
\$\begingroup\$ Have you a suggested alternative name for
Statistics
? I couldn't think of a better one. \$\endgroup\$Edward– Edward2015年01月26日 18:42:27 +00:00Commented Jan 26, 2015 at 18:42 -
\$\begingroup\$ Finding a name is difficult, maybe
MathematicalStatistics
or have it wrapped in a namespacemath
would resolve the ambiguity. MaybeDescriptiveStatistics
orSummaryStatistics
? \$\endgroup\$Emily L.– Emily L.2015年01月26日 18:49:28 +00:00Commented Jan 26, 2015 at 18:49
I'd use std::accumulate
for variance
as well, along the lines of (F
being a template type for variance
):
variance(std::accumulate(start, end, 0, [](F x, T y) -> F { return x + y*y; })/(n - 1))
Also, care must be taken for sample == true
and n == 1
.
-
\$\begingroup\$ Is there an advantage to using
std::accumulate
and a lambda as contrasted with just a lambda? \$\endgroup\$Edward– Edward2015年01月26日 19:54:45 +00:00Commented Jan 26, 2015 at 19:54 -
\$\begingroup\$ I strictly adhere to "no raw loops" mantra. The loop represents an algorithm, and deserves a name; if one already exists in a standard library, use it. \$\endgroup\$vnp– vnp2015年01月26日 20:35:29 +00:00Commented Jan 26, 2015 at 20:35
You could compute all these statistics in one pass, instead of two in your implementation. This could have big performance repercussions if your input data is large.
-
\$\begingroup\$ Could you elaborate a little more in your answer please? \$\endgroup\$Malachi– Malachi2015年07月23日 14:56:07 +00:00Commented Jul 23, 2015 at 14:56
Explore related questions
See similar questions with these tags.