Meta-programming template for collection statistics

Question 1

I find that I often want to calculate summary statistics from collections of data and decided to create a template to do that. I'm interested in a general review, but I also have some specific questions.

Is the use of the lambda "too clever?"
Is std::accumulate OK? Specifically, I'm thinking of the potential for overflow for large collections of small data types.
Is the template missing anything?
Should I be using something such as std::enable_if to assure that the Iter type is actually an iterator?

The code to be reviewed is in stats.h but stats.cpp shows how it can be used.

stats.h

#ifndef STATS_H
#define STATS_H
#include <cmath>
#include <numeric>
#include <iterator>
template <typename T, typename Iter>
struct Statistics {
 Statistics(Iter start, Iter end, bool sample=true) :
 n(std::distance(start, end)),
 sum(std::accumulate(start, end, 0)),
 mean(sum / n),
 variance([&start, &end, sample, this]()->T { 
 T sumsq = 0;
 for (auto it=start; it != end; ++it) {
 sumsq += (*it-mean)*(*it-mean);
 }
 return sumsq/(n- (sample ? 1 : 0));
 }()
 ),
 sdev(sqrt(variance))
 {
 }
 unsigned n;
 T sum;
 T mean;
 T variance;
 T sdev;
};
#endif //STATS_H

stats.cpp

#include <iostream>
#include <vector>
#include <list>
#include "stats.h"
template <typename T>
std::ostream &print(std::ostream &out, const T &stats) {
 out << "n = " << stats.n << '\n';
 out << "sum = " << stats.sum << '\n';
 out << "mean = " << stats.mean << '\n';
 out << "var = " << stats.variance << '\n';
 out << "sdev = " << stats.sdev << '\n';
 return out;
}
int main()
{
 std::vector<int> sample{9, 2, 5, 4, 12, 7};
 Statistics<float, decltype(sample.cbegin())> stats(sample.cbegin(), sample.cend());
 print(std::cout, stats);
 std::list<float> sample2{2.0, 4.0, 5.0, 7.0, 9.0, 12.0};
 Statistics<double, decltype(sample2.cbegin())> stats2(sample2.cbegin(), sample2.cend());
 print(std::cout, stats2);
}

Sample output

n = 6
sum = 39
mean = 6.5
var = 13.1
sdev = 3.61939
n = 6
sum = 39
mean = 6.5
var = 13.1
sdev = 3.61939

Question 2

The lambda isn't too bad. I would argue the too clever part is actually the fact that your initialization depends on the declaration order of the variables. If some maintainer comes along and for some reason (OCD or other) rearranges the order of the variable declarations, your initialization code becomes incorrect. If you performed all of those calculations in the constructor, your code would be more robust to such changes. I can't imagine there would be any performance hit at all in a release/optimized compilation.

Question 3

Just as a note, there is an accumulator library in boost that does this and a lot, lot more.

Question 4

Use of `T` for accumulator and result

I question the decision to use a template parameter T as the result and accumulator type. Especially in the calculation of the variance, using T as accumulator and mean value will cause significant errors in the calculated standard deviation and variance. The mean value may be off by at most \$\lim_{x\rightarrow 1}x\$. I cannot think of any situation where if you are interested in statistics, you would want to use anything else than double.

Clever lambda

While the lambda trick is easy to understand, I question the necessity of it when the same code could have been implemented in the constructor body and with less typing. This would also allow to use double for the mean internally and only truncate to T after assigning, this would bound the error on the variance and standard deviation. Edit: at least if the result is within the domain of T.

Use of `std::accumulate`

For small types, say short the risk of overflow is large. For double you run the risk of truncation errors when there are many samples. For accuracy see my answer here: Function that normalizes a vector of double. The same applies here and depends on the level of accuracy that you require, seeing as it is statistics, I wouldn't be too worried.

What about min/max value?

One stat that I'm missing is the maximum and minimum value.

Performance

What if I don't need the standard deviation but only the mean? You might want to implement the sum, mean and stdev as separate classes so the user can choose if she only needs the mean for example.

Naming

I find the name Statistics a bit too generic. The name may suggest statistics about anything (like for example memory usage for the day) but here it means the sample count, sum, mean, variance and standard deviation.

Documentation

The variable sample wasn't immediately obvious that it was used to distinguish between sample and population statistics. A comment for the constructor explaining the arguments might be in order.

Checking for iterator type

The compiler will emit an error if the type used for iterator isn't suitable, so strictly speaking this isn't really necessary if it complicates the code. Also I'm not completely sure how you would test that a fulfills the iterator concept in a useful way. I mean you can test for the existence of std::iterator_trait<Iter>::iterator_category but that can have false negatives.

But I never turn down a nice error message telling me what I'm doing wrong instead of a horrendous template substitution failure wall of text that MSVC emits... :)

Question 5

Thanks for the review. Regarding the use of T as a template parameter, I actually had written it as double and then made a last minute change. Based on your feedback, I think I'll change it back.

Question 6

Have you a suggested alternative name for Statistics? I couldn't think of a better one.

Question 7

Finding a name is difficult, maybe MathematicalStatistics or have it wrapped in a namespace math would resolve the ambiguity. Maybe DescriptiveStatistics or SummaryStatistics?

Question 8

I'd use std::accumulate for variance as well, along the lines of (F being a template type for variance):

variance(std::accumulate(start, end, 0, [](F x, T y) -> F { return x + y*y; })/(n - 1))

Also, care must be taken for sample == true and n == 1.

Question 9

Is there an advantage to using std::accumulate and a lambda as contrasted with just a lambda?

Question 10

I strictly adhere to "no raw loops" mantra. The loop represents an algorithm, and deserves a name; if one already exists in a standard library, use it.

Question 11

You could compute all these statistics in one pass, instead of two in your implementation. This could have big performance repercussions if your input data is large.

Question 12

Could you elaborate a little more in your answer please?

Emily L. Emily L.Emily L. 16.7k1 gold badge39 silver badges89 bronze badges · Accepted Answer · 2015-01-26 17:51:53Z

Use of `T` for accumulator and result

I question the decision to use a template parameter T as the result and accumulator type. Especially in the calculation of the variance, using T as accumulator and mean value will cause significant errors in the calculated standard deviation and variance. The mean value may be off by at most \$\lim_{x\rightarrow 1}x\$. I cannot think of any situation where if you are interested in statistics, you would want to use anything else than double.

Clever lambda

While the lambda trick is easy to understand, I question the necessity of it when the same code could have been implemented in the constructor body and with less typing. This would also allow to use double for the mean internally and only truncate to T after assigning, this would bound the error on the variance and standard deviation. Edit: at least if the result is within the domain of T.

Use of `std::accumulate`

For small types, say short the risk of overflow is large. For double you run the risk of truncation errors when there are many samples. For accuracy see my answer here: Function that normalizes a vector of double. The same applies here and depends on the level of accuracy that you require, seeing as it is statistics, I wouldn't be too worried.

What about min/max value?

One stat that I'm missing is the maximum and minimum value.

Performance

What if I don't need the standard deviation but only the mean? You might want to implement the sum, mean and stdev as separate classes so the user can choose if she only needs the mean for example.

Naming

I find the name Statistics a bit too generic. The name may suggest statistics about anything (like for example memory usage for the day) but here it means the sample count, sum, mean, variance and standard deviation.

Documentation

The variable sample wasn't immediately obvious that it was used to distinguish between sample and population statistics. A comment for the constructor explaining the arguments might be in order.

Checking for iterator type

The compiler will emit an error if the type used for iterator isn't suitable, so strictly speaking this isn't really necessary if it complicates the code. Also I'm not completely sure how you would test that a fulfills the iterator concept in a useful way. I mean you can test for the existence of std::iterator_trait<Iter>::iterator_category but that can have false negatives.

But I never turn down a nice error message telling me what I'm doing wrong instead of a horrendous template substitution failure wall of text that MSVC emits... :)

Thanks for the review. Regarding the use of T as a template parameter, I actually had written it as double and then made a last minute change. Based on your feedback, I think I'll change it back.
Have you a suggested alternative name for Statistics? I couldn't think of a better one.
Finding a name is difficult, maybe MathematicalStatistics or have it wrapped in a namespace math would resolve the ambiguity. Maybe DescriptiveStatistics or SummaryStatistics?

Stack Exchange Network

Meta-programming template for collection statistics

stats.h

stats.cpp

Sample output

3 Answers 3

Use of `T` for accumulator and result

Clever lambda

Use of `std::accumulate`

What about min/max value?

Performance

Naming

Documentation

Checking for iterator type

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Meta-programming template for collection statistics

stats.h

stats.cpp

Sample output

3 Answers 3

Use of T for accumulator and result

Clever lambda

Use of std::accumulate

What about min/max value?

Performance

Naming

Documentation

Checking for iterator type

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Use of `T` for accumulator and result

Use of `std::accumulate`