As part of my training, I implemented a dataset class in C++ that allows me to create synthetic datasets to get more familiar with features that C++ offers such as object oriented programming.
This implementation uses a simple method to create two intertwined spirals as shown in the following image.
The output looks as follows:
x y label
-3.85616 -0.987058 0
1.16 4.6764 0
2.66279 2.13692 1
-0.643551 3.81989 0
-0.252046 -0.888411 0
Please be as hard as possible with this implementation and give me constructive feedback.
I would appreciate advice especially in the following areas:
- Code style (readability, naming conventions, etc...)
- Class design
- Efficiency (how to avoid unnecessary complexity)
- Reinventing the wheel (does the STL offer functionality that I should use?)
main.cpp
#include <iostream>
#include <vector>
#include "data.hpp"
int main() {
unsigned int n_points = 3000;
Data data (n_points);
std::vector<Point> d;
d = data.get();
for (std::vector<Point>::iterator it = d.begin(); it != d.end(); ++it) {
std::cout << it->x << " " << it->y << " " << it->label << std::endl;
}
return 0;
}
data.hpp
#ifndef DATA_H
#define DATA_H
#include <vector>
#include <cmath>
#include "util.hpp"
class Data {
public:
Data (unsigned int);
unsigned int n_samples;
unsigned int n_dims = 2;
unsigned int n_classes = 2;
std::vector<Point> get();
private:
Random random;
std::vector<Point> data;
std::vector<Point> generate_data();
};
Data::Data (unsigned int _n_samples) {
n_samples = _n_samples;
}
std::vector<Point> Data::generate_data() {
std::vector<Point> data(n_samples);
std::vector<Point>::iterator it;
double noise = 1.0;
double spread = 4.0;
for (it = data.begin(); it != data.end(); ++it) {
double r = 3.14 * spread * random.rand();
double r1 = -cos(r) * r + noise * random.rand();
double r2 = sin(r) * r + noise * random.rand();
if (random.rand() < 0.5) {
it->x = r1;
it->y = r2;
it->label = 1;
} else {
it->x = -r1;
it->y = -r2;
it->label = 0;
}
}
return data;
}
std::vector<Point> Data::get() {
std::vector<Point> data(n_samples);
std::vector<Point>::iterator it;
data = generate_data();
return data;
}
#endif /* DATA_H */
util.hpp
#ifndef UTIL_H
#define UTIL_H
#include <cstdlib>
#include <ctime>
struct Point {
double x;
double y;
unsigned int label;
};
class Random {
public:
Random ();
double rand();
};
Random::Random () {
std::srand(static_cast<unsigned int>(std::time(nullptr)));
}
double Random::rand() {
return static_cast<double> (std::rand()) / static_cast<double> (RAND_MAX + 1U);
}
#endif /* UTIL_H */
5 Answers 5
Here are some things that may help you improve your program.
Separate interface from implementation
The interface goes into a header file and the implementation (that is, everything that actually emits bytes including all functions and data) should be in a separate .cpp
file. The reason is that you might have multiple source files including the .h
file but only one instance of the corresponding .cpp
file. In other words, split your existing data.hpp
and util.hpp
files into a .h
file and a .cpp
file.
Understand the use of namespaces
Your code is correctly using <cmath>
instead of <math.h>
. The difference between the two forms is that the former defines things within the std::
namespace versus into the global namespace. So this implies that instead of using cos
and sin
, the code should be using std::cos
and std::sin
Use const
where practical
There are a number of named constants in the program such as unsigned int n_dims = 2;
. It's very good practice to name these, but even better is to declare them const
or better still constexpr
. The reason is that it allows the compiler to perform better optimizations without sacrificing readability.
Don't use std::endl
if you don't really need it
The difference betweeen std::endl
and '\n'
is that '\n'
just emits a newline character, while std::endl
actually flushes the stream. This can be time-consuming in a program with a lot of I/O and is rarely actually needed. It's best to only use std::endl
when you have some good reason to flush the stream and it's not very often needed for simple programs such as this one. Avoiding the habit of using std::endl
when '\n'
will do will pay dividends in the future as you write more complex programs with more I/O and where performance needs to be maximized.
Use "range for
" and simplify your code
The code currently includes these lines:
for (std::vector<Point>::iterator it = d.begin(); it != d.end(); ++it) {
std::cout << it->x << " " << it->y << " " << it->label << std::endl;
}
Since C++11, however, we have a range for
which simplifies it to this:
for (const auto& point : d) {
std::cout << point.x << " " << point.y << " " << point.label << '\n';
}
Write a stream inserter for your class
We can further refine the for
loop above by creating a custom inserter for the Point
class:
std::ostream& operator<<(std::ostream& out, const Point& point) {
return out << point.x << " " << point.y << " " << point.label;
}
Now our loop in main is even simpler:
for (const auto& point : d) {
std::cout << point << '\n';
}
Use standard library algorithms
We can use std::copy
to copy from the vector to std::cout
, replacing the loop with a single statement:
std::copy(d.begin(), d.end(), std::ostream_iterator<Point>(std::cout, "\n"));
Consider using a better random number generator
If you are using a compiler that supports at least C++11, consider using a better random number generator. In particular, instead of rand
, you might want to look at std::uniform_real_distribution
and friends in the <random>
header. Also, instead of doing this:
if (random.rand() < 0.5) {
it->x = r1;
it->y = r2;
it->label = 1;
} else {
it->x = -r1;
it->y = -r2;
it->label = 0;
}
I would suggest using a std::bernoulli_distribution
.
Rethink the classes
Rather than having the vaguely named Data
class creating Point
s, I'd suggest moving that functionality into the Point
class. Since this is actually creating a Point
, it makes sense to me to create a custom constructor with this functionality:
Point::Point() {
static std::random_device rd;
static std::mt19937 gen(rd());
static std::uniform_real_distribution<> spread(0.0, 4.0*3.14);
static std::uniform_real_distribution<> noise(0.0, 1.0);
static std::bernoulli_distribution type(0.5);
static constexpr double multiplier[]{-1.0, +1.0};
label = type(gen);
double r = spread(gen);
x = multiplier[label] * (-std::cos(r) * r + noise(gen));
y = multiplier[label] * (+std::sin(r) * r + noise(gen));
}
Minimize memory use
It's not critical for this program, but rather than storing everything in a vector
only to print it and then destroy the vector, a more memory efficient approach would be to generate each point as needed and emit it directly. Using the Point
constructor and stream inserter as defined above, this becomes trivial.
Result
Here's the entire program as rewritten with all of these ideas:
#include <random>
#include <iostream>
#include <iterator>
#include <cmath>
struct Point {
Point();
double x;
double y;
unsigned int label;
};
Point::Point() {
static std::random_device rd;
static std::mt19937 gen(rd());
static std::uniform_real_distribution<> spread(0.0, 4.0*3.14);
static std::uniform_real_distribution<> noise(0.0, 1.0);
static std::bernoulli_distribution type(0.5);
static constexpr double multiplier[]{-1.0, +1.0};
label = type(gen);
double r = spread(gen);
x = multiplier[label] * (-std::cos(r) * r + noise(gen));
y = multiplier[label] * (+std::sin(r) * r + noise(gen));
}
std::ostream& operator<<(std::ostream& out, const Point& point) {
return out << point.x << " " << point.y << " " << point.label;
}
int main() {
for (auto i{3000}; i; --i) {
std::cout << Point() << '\n';
}
}
-
\$\begingroup\$ Rather than
for (auto i{3000}; i; --i) ...
, you could end withstd::generate_n(std::ostream_iterator<Point>(std::cout, "\n"), 3000, []{ return Point(); })'
\$\endgroup\$Caleth– Caleth2021年01月02日 23:01:11 +00:00Commented Jan 2, 2021 at 23:01 -
\$\begingroup\$ @Caleth: True, and I considered it but opted for the simpler version shown. \$\endgroup\$Edward– Edward2021年01月02日 23:04:16 +00:00Commented Jan 2, 2021 at 23:04
-
\$\begingroup\$ You are seeding the pseudorandom number generator with a single
unsigned int
from the random device. The state size ofstd::mt19937
is 19937 bits, so it is unlikely (although technically permitted by the standard) that a singleunsigned int
will be enough: it will work but will only allow access to a tiny subset of the possible initial states. It would be better to use astd::seed_seq
instead. \$\endgroup\$user235580– user2355802021年01月03日 18:17:33 +00:00Commented Jan 3, 2021 at 18:17 -
\$\begingroup\$ @antispinwards: you're right that the current seed is not very good. However, I'm not sure how to do better, even with
std::seed_seq
. See this for example. Ideas on how to better address this would be most welcome! \$\endgroup\$Edward– Edward2021年01月03日 18:22:47 +00:00Commented Jan 3, 2021 at 18:22 -
\$\begingroup\$ @Edward - yes,
std::seed_seq
is less than ideal (barring the C++ folks having a fit of sanity and bugfixing it in a way that doing the correct thing with sufficient data is given priority over doing something with insufficient data), but that doesn't mean we should just throw up our hands and seed the generator in a way that can provide at most 2^32 possibilities (for common definitions ofunsigned int
) out of 2^19937. 2^32 isn't a particularly large number these days for this kind of thing. \$\endgroup\$user235580– user2355802021年01月04日 09:19:04 +00:00Commented Jan 4, 2021 at 9:19
This is a nice, readable piece of code. I would just highlight a few things you might want to consider.
Unnecessary Class
Let's take a close look at Random
class. It holds no state, just a member function rand
. In addition to having just one member function, it is only used in one place, in member function generate_data
. Classes are meant to be reusable and Random
does not serve that purpose. A common solution is to create a free function that randomly generate numbers and returns the output.
Prefer free-functions to member-functions
.
Readability Review
It is nice to spell out the iterator name fully but most times, auto
keyword looks more readable. This means this code
for (std::vector<Point>::iterator it = d.begin(); it != d.end(); ++it) {
std::cout << it->x << " " << it->y << " " << it->label << std::endl;
}
can be rewritten as
for(const auto& point : points)
std::cout << point << '\n'
NOTE: You have to define std::cout
that takes your class as it's second argument.
Use good namings
There is much going on in generate_data
and can get someone easily confused. avoid one letter variable as much as possible. Anyone reading your code without proper context on what r
and x
means would easily get confused. Your names should carry intent and they should explain what's going at first glance.
Nitpicks
- This piece of code looks ugly
std::vector<Point> Data::generate_data() {
std::vector<Point> data(n_samples);
....
}
which data are we really assigning to member-variable
data or local-variable
data? In this case, we are creating a local-variable data. Avoid shadowing your member-variables.
- Prefer member-initialization to assignment. This piece of code
Data::Data (unsigned int _n_samples) { // initialized in member initialization list
n_samples = _n_samples;
}
creates and default initializes n_samples
and also make the assignment. This could have been simply initialized at creation time. A proper way of doing that is as follow
Data::Data (unsigned int _n_samples) :
n_samples{_n_samples}
- Your naming convention are inconsistent, sometimes you prefix with
_
other times you don't. Choose a style and stick with it.
Perspective
The answers that were already provided are great for solving the particular problem you have. I will provide feedback based on my current workflow. I implement a tool for those who would like to make quick C++ prototypes without much regard to engineering, which I then fix and make more of a complete solution.
Underlying problem
What you want to do is a language to describe relation of random variables. The rest is about loops and choosing distributions, sources of random variables, etc. Embedded languages in C++ are written either using macros or expression templates. Since we don't need any source level modifications, lets stick with the latter.
The moving parts
There are already many libraries for random number generation. Some do not like standard C++ facilities for that, but might have desirable properties not provided by standard facilities (for example curand has GPU support, which is very useful in case you need to generate a lot of data). We need a way to latch onto existing libraries. The most feasible way in my opinion is to provide CRTP classes that will add the layer we need:
Providing distribution support
Strict typing
Container filling (
std::ranges
, iterators)
Rough sketch of it would look like this:
template <typename RandomSource, typename Distribution>
class independent_distributed_variable {
Distribution distribution;
public:
template <typename ... Args>
random_variable(Args&& ... args):
distribution(std::forward<Args>(args)...)
{}
auto operator()() {
auto& source = static_cast<RandomSource&>(*this);
return distribution(source);
}
template <typename OutputIterator>
void operator()(OutputIterator first, OutputIterator) {
// use `std::generate` and delegate to singular version
// or use batch filling
}
};
One might still have to write some wrapper classes (in case interfaces are not compatible). This is not the most important problem though.
The random variable description language
This is the most "C++" part of the problem. Lets nail down the usage first. The random variables in your case:
\$R \sim N(0.5, 0.15)\$
\$label\ \sim B(1, 0.5)\$
\$X \sim (-\cos (4 * \pi * R) + R) * (-1)^{label}\$
\$Y \sim (\sin (4 * \pi * R) + R) * (-1)^{label}\$
(Sorry if I butchered the terminology and notation, just wanted to write it down in a more readable way)
The closest approximation in C++ would be (given the library namespace shino
):
auto r = shino::normal_distribution(0.5, 0.15);
auto label = shino::bernoulli_distribution(0.5);
auto x = (-shino::cos(4 * shino::pi * r) + r) * shino::power(-1, label);
auto y = (shino::sin(4 * shino::pi * r) + r) * shino::power(-1, label);
Then people could just fill their own struct with a loop or std::generate
if they want. I will give motivations later on leaving this part open ended.
Implementation guidelines
The naive implementation is pretty easy. Lets consider sine expression.
template <typename AlphaVariable>
class sine_expression {
AlphaVariable alpha;
public:
sine_expression(AlphaVariable alpha):
alpha(alpha)
{}
auto operator()() {
return std::sin(alpha()); // evaluate the dependency first
}
};
template <typename AlphaVariable>
sine_expression<AlphaVariable> sin(AlphaVariable alpha) {
return {alpha};
}
There is one problem that springs out: the implementer will need to differentiate between temporary expressions and expressions saved into a variable (sin(x + y)
vs just sin(x)
, the former will have temporary plus_expression
, the latter will have lvalue of whatever type is x
). The problem is definitely solvable, albeit requiring more code.
Importance of user provided data structures
In computer vision field, most trained models I encountered want tightly packed image pixels (e.g. 3 channel image of float channels will have pixels at addresses multiple of 12, not 16). By default compiler will pad the struct of 3 floats to 16 bytes, which will break the inference. As such there is sometimes a need to fill weird structs and write at unusual memory offsets.
-
\$\begingroup\$ There is also an edge case:
r
is "rolled" each time per usage, whilelabel
is not. I will try to fix that tomorrow and amend my answer. \$\endgroup\$Incomputable– Incomputable2021年01月02日 19:39:58 +00:00Commented Jan 2, 2021 at 19:39 -
\$\begingroup\$ I like this approach. I had also wondered about using C++20 range adapters. \$\endgroup\$Edward– Edward2021年01月03日 00:03:02 +00:00Commented Jan 3, 2021 at 0:03
-
1\$\begingroup\$ what's "shino"? \$\endgroup\$Noone AtAll– Noone AtAll2021年01月03日 23:10:46 +00:00Commented Jan 3, 2021 at 23:10
-
\$\begingroup\$ @NooneAtAll, it is the character in my avatar. I usually either call my library namespaces with strongly related words or with the nickname I like. \$\endgroup\$Incomputable– Incomputable2021年01月05日 14:50:26 +00:00Commented Jan 5, 2021 at 14:50
Design:
Data
is not a descriptive name, and also misrepresents what the class does. It's really a generator function that's triggered by calling get()
.
As such, it would be better to replace it with a function, called something like generate_spirals()
.
We should use the C++ standard library random
header for random number generation. This gives us various random number generators, e.g. std::mt19937
, that we could use instead of the custom Random
class.
Lastly, I'd suggest doing the labelling and inversion as separate stages. So we instead generate two spirals using the same code, and invert one of them. That way we don't need the label in the point class, we can just store two separate vectors of (x, y)
coordinates and add the label when printing them.
So something like:
#include <random>
#include <vector>
#include <cmath>
struct Point {
double x;
double y;
};
std::vector<Point> generate_spiral(unsigned int samples, double noise, double spread, std::mt19937& rng) {
auto points = std::vector<Point>();
points.reserve(samples);
auto dist = std::uniform_real_distribution<double>(0.0, 1.0);
for (auto n = 0u; n != samples; ++n) {
double r = 3.14 * spread * dist(rng);
double x = -std::cos(r) * r + noise * dist(rng);
double y = std::sin(r) * r + noise * dist(rng);
points.push_back({ x, y });
}
return points;
}
std::vector<Point> invert_points(std::vector<Point> points) {
for (auto& p : points) {
p.x = -p.x;
p.y = -p.y;
}
return points;
}
#include <iostream>
int main() {
auto rng = std::mt19937(std::random_device()());
auto samples = 200u;
auto noise = 1.0;
auto spread = 4.0;
auto s1 = generate_spiral(samples, noise, spread, rng);
auto s2 = invert_points(generate_spiral(samples, noise, spread, rng));
for (auto const& p : s1)
std::cout << p.x << " " << p.y << " " << "1" << "\n";
for (auto const& p : s2)
std::cout << p.x << " " << p.y << " " << "0" << "\n";
}
Note the range-based for loops using auto
make the output code much less verbose.
Note that << std::endl
flushes the output as well as adding a new line. Since we don't need the flushing behavior, we should use << "\n"
instead.
This is more an amendment to Edward's answer, but it's a bit too long to do in comments.
In order to seed the pseudorandom number generator (PRNG), you should provide sufficient data to initialise the PRNG state.
The common (and I would argue obvious) approach of
std::random_device rd;
std::mt19937 gen(rd());
would be nice if it worked but this is C++ where the obvious way of doing things is often wrong. The reason this is insufficient is that operator()
on std::random_device
returns an unsigned int
, which is permitted to be as small as 16 bits (but will probably be 32 bits on common systems). Meanwhile, the std::mt19937
PRNG has an internal state of 19937 bits, which suggests the above approach provides much too little data.
The approach the standard library takes is to use std::seed_seq
, which takes a sequence of numbers, combining the lower 32 bits of each. In order to provide this, we can use a vector:
std::vector<std::uint_least32_t> seed_data;
We also need to get std::uint_least32_t
out of the random device that supplies unsigned int
(which might be only 16 bits), so we use a uniform distribution:
std::uniform_int_distribution<std::uint_least32_t> rd_dist;
As an aside: this will provide a distribution over the whole range of std::uint_least32_t
: if you're worried about "wasting entropy" on a hypothetical system where this type is wider than 32 bits you could explicitly specify the range and hope that the provided implementation takes advantage of this to pull less data from the std::random_device
:
std::uniform_int_distribution<std::uint_least32_t> rd_dist{ 0, UINT32_C(0xffffffff) };
The above is almost certainly unnecessary though.
The amount of data we need to fill this vector can be determined from properties on the std::mt19937
type. Unfortunately C++ does not directly provide a property that gets this directly, but it can be computed as follows:
constexpr std::size_t seed_size = std::mt19937::state_size * (std::mt19937::word_size / 32);
If you decide to replace std::mt19937
with std::mt19937_64
or another instantiation of the std::mersenne_twister_engine
template, just replace the std::mt19937
in the above line and it should still work.
So we then fill the vector:
seed_data.reserve(seed_size);
std::generate_n(std::back_inserter(seed_data), seed_size, [&]() { return rd_dist(rd); });
Then create the seed sequence and create the generator:
std::seed_seq seed(seed_data.cbegin(), seed_data.cend());
std::mt19937 gen{ seed };
Putting that all together:
#include <algorithm>
#include <cstddef>
#include <cstdint>
#include <iterator>
#include <random>
#include <vector>
/* ... */
std::random_device rd;
std::uniform_int_distribution<std::uint_least32_t> rd_dist;
std::vector<std::uint_least32_t> seed_data;
constexpr std::size_t seed_size = std::mt19937::state_size * (std::mt19937::word_size / 32);
seed_data.reserve(seed_size);
std::generate_n(std::back_inserter(seed_data), seed_size, [&]() { return rd_dist(rd); });
std::seed_seq(seed_data.cbegin(), seed_data.cend());
std::mt19937 gen{ seed_seq };
I'm sure most people would agree that's a perfectly succinct and obvious way to initialise a PRNG. 🙃
As noted by Edward in the comments to their answer, this is still not ideal due to issues in the way std::seed_seq
combines its values, and the fact that in their wisdom the C++ folks decided that it was perfectly ok to have std::random_device
be deterministic (rather than, say, being a compile error on systems that don't provide a sufficient source of entropy, which would alert the programmers to the issue early on rather than being a potential source of runtime bugs).
Nevertheless, the general principle you should follow is to provide sufficient entropy to match the state of the PRNG, rather than just a single unsigned int
.
-
1\$\begingroup\$ This is a valuable contribution. I am going to see about basing a question on this in the next few days. Thanks very much for taking the time to write it up! \$\endgroup\$Edward– Edward2021年01月04日 18:17:56 +00:00Commented Jan 4, 2021 at 18:17