[NOTE] This question can be depreciated in favor of version 0.32.
This is a code revision of a previous post and works well.
The purpose of this code is to produce a universe of points, randomly generated around predetermined centroids, provided from the user as either a vector of vectors or from a file. The final product is an output of sample points produced around the centroids, to be used for fake data analysis in another program. The objective here is brevity and speed. This most recent version is about 100 lines between the .h
and .cpp
.
Changes in This Version:
- Reporting is completely gone. This returned the
.h
and.cpp
back down to ~100 lines of code. - I/O is now passed by
ostream
only (instead ofstring
), allowing for console output and other uses. cluster_set
has been simplified - members have been decreased significantly.cluster_set
now maintains a function pointer toward a distribution which can be modified by the user- Variables have been renamed for clarity.
const
s have been removed for the time being; ran into errors, particularly whenconst
ing any of the parameters exceptunsigned int
.- Got rid of a default constructor to force-intake centroids on any construction; this makes object instancing more fluid (there is no reason to have a centroid-empty
cluster_set
).
My goal is two-fold: speed and conciseness. Speed takes precedence over conciseness, but thankfully they tend to go hand in hand.
Future Implementations:
I am very happy with the state of this program so far (nearly every major issue has been addressed)...
- I've been successfully using a time-based randomized seed, however it's been suggested (a number of times) that I switch to
std::random_device{}()
. However, this seed is not producing random results - I get the same "random" set of points every time I run the program. Perhaps I'm implementing incorrectly? - The way this is currently coded, the default distribution must exist outside of the
cluster_set
member functions. Thedistribution
function pointer (a member ofcluster_set
) then points to this global function. I run into scoping issues with the function pointer if I try to include the default function as a member variable, and I feel this would bloat the code unnecessarily to try to force this to work. One thing I like about the current state is the "smallness" of the code. It just feels a little "dangling" to have a global function in a.h
. - The distribution currently works on a dimension-by-dimension basis: you pass in a dimension's value, you get back a random number near that value. In v0.2 it was observed that it may be valuable to code this in a point-by-point basis (you pass in an entire point, you get back an entire randomized dimensional set near that point). When I attempted this, it bloated the code - it is also something not needed for my purposes.
- Error handling.
Notes:
- I understand
this->
is a matter of personal preference in most cases. The reason I'm partial to it is I often code when I'm tired andthis->
reminds me that I'm looking at a member variable, not at something else (like a function parameter). - I am very happy with this update - shout out to Justin for your pivotal suggestions on v0.2 and v0.3. A thank you to the rest of the community as well!
clustergen.h
#ifndef CLUSTERGEN_H
#define CLUSTERGEN_H
#include <fstream>
#include <vector>
double default_distribution(double &); // The default distribution function (Normal)
class cluster_set {
std::vector<std::vector<double>> centroids; // Centroids around which to evenly generate all points
double (*distribution)(double &); // Changeable pointer to a distribution function
void import_centroids(std::vector<std::vector<double>> &); // Import centroids from vector
public:
cluster_set(std::ifstream &, char); // Import centroids from file with specified delimiter
cluster_set(std::vector<std::vector<double>> &); // Import centroids from vector on construction
void clustergen(unsigned int, std::ostream &, char);
void set_distribution(double (*new_distribution)(double &)) { this->distribution = new_distribution; }
};
#endif //CLUSTERGEN_H
clustergen.cpp
#include "clustergen.h"
#include <chrono>
#include <iostream>
#include <random>
#include <sstream>
double default_distribution(double & dimension) {
static std::default_random_engine gen(std::chrono::system_clock::now().time_since_epoch().count()); // Random seed
// static std::default_random_engine gen(std::random_device{}()); // Not randomizing??
std::normal_distribution<double> distr(dimension, 1);
return distr(gen);
}
// Import centroids from file with specified delimiter into a temporary vector - calls import_centroids()
cluster_set::cluster_set(std::ifstream & input_file, char delimiter) {
this->distribution = default_distribution;
std::string line;
std::vector<std::vector<double>> temp_centroid_vector;
while (std::getline(input_file, line)) {
while ((line.length() == 0) && !(input_file.eof())) {
std::getline(input_file, line); // Skips blank lines in file
}
std::string parameter;
std::stringstream ss(line);
std::vector<double> temp_point;
if ((line.length() != 0)) {
while (std::getline(ss, parameter, delimiter)) {
temp_point.push_back(atof(parameter.c_str()));
}
temp_centroid_vector.push_back(temp_point);
}
this->import_centroids(temp_centroid_vector);
}
}
// Import centroids from vector on construction
cluster_set::cluster_set(std::vector<std::vector<double>> & centroid_vector) {
this->distribution = default_distribution;
this->import_centroids(centroid_vector);
}
// Primary centroid import function
void cluster_set::import_centroids(std::vector<std::vector<double>> & centroid_vector) {
for (auto centroid_vector_iter = centroid_vector.begin(); centroid_vector_iter != centroid_vector.end(); ++centroid_vector_iter) {
if (this->centroids.empty()) {
this->centroids.push_back(*centroid_vector_iter);
} else if (centroid_vector_iter->size() == this->centroids.front().size()) { // Assures dimensional integrity
this->centroids.push_back(*centroid_vector_iter);
}
}
}
// Primary cluster generator - aborts if no centroids have been imported.
void cluster_set::clustergen(unsigned int k, std::ostream & output, char delimiter) {
if (this->centroids.empty()) {
output << "ERROR: No centroids have been imported. Aborting operation.";
return;
}
if (k < this->centroids.size()) { k = this->centroids.size(); }
const unsigned int n = k / this->centroids.size(); // Evenly distributes points across centroids
unsigned rem = k % this->centroids.size(); // Evenly distributes points across centroids
for (auto centroid_iter = this->centroids.begin(); centroid_iter != this->centroids.end(); ++centroid_iter) {
unsigned int subset = n + (rem ? 1 : 0); // Evenly distributes points across centroids
while (subset) {
std::vector<double> temp_point;
for (auto dimension_iter = centroid_iter->begin(); dimension_iter != centroid_iter->end(); ++dimension_iter) {
temp_point.push_back(distribution(*dimension_iter));
}
for (auto temp_point_iter = temp_point.begin(); temp_point_iter != temp_point.end(); ++temp_point_iter) {
if (temp_point_iter != temp_point.begin()) { output << delimiter; }
output << (*temp_point_iter);
}
if (subset - 1) { output << "\n"; };
--subset;
if (rem) { --rem; } // Evenly distributes points across centroids
}
auto centroid_iter_peek = centroid_iter;
++centroid_iter_peek;
if (centroid_iter_peek != centroids.end()) { output << "\n"; };
}
}
main.h
(Contains a few examples)
#include "clustergen.h"
#include <chrono>
#include <iostream>
#include <random>
double new_distribution(double & dimension) {
static std::default_random_engine gen(std::chrono::system_clock::now().time_since_epoch().count()); // Random seed
// static std::default_random_engine gen(std::random_device{}()); // Not randomizing??
std::uniform_int_distribution<int> distr(-5, 5);
return dimension + 10 * distr(gen);
}
int main() {
std::vector<std::vector<double>> v = {{-100, -100},
{100, 100},
{1000, 1000}};
std::vector<std::vector<double>> v2 = {{-100, -100},
{1},
{100, 100},
{1, 2, 3},
{1000, 1000}}; // Dimensional mismatch
std::ostream &output_console = std::cout;
std::ofstream output_file;
cluster_set my_clusters(v); // Vector constructor
my_clusters.clustergen(11, output_console, ','); // Generate 10 random points to the console (',' delimited)
std::ofstream out2;
out2.open("clustergen_out_2.dat");
cluster_set my_clusters2(v2); // Vector constructor with invalid dimensional points (omitted)
my_clusters2.set_distribution(new_distribution); // Setting a user-defined distribution
my_clusters2.clustergen(11, out2, ','); // Generate 11 random points to "clustergen_out_2.dat" (',' delimited)
std::ifstream v3;
v3.open("clustergen_in.dat");
std::ofstream out3;
out3.open("clustergen_out_3.dat");
cluster_set my_clusters3(v3,'$'); // File const. with user-spec. delimiter - blank lines and invalid dimensions omitted
my_clusters3.clustergen(13, out3, '@'); // Generate 13 random points to "clustergen_out_3.dat" (',' delimited)
}
clustergen_in.dat
(Used in the main.h
examples above)
210220ドル
230240ドル
250260ドル270ドル
280290ドル
200
1 Answer 1
By community recommendation, I will wait 48 hours before posting a version update. Two significant errors have been discovered and will be corrected in the next version:
Erratic Point Generation
The call to this->import_centroids()
in cluster_set::cluster_set(std::ifstream & input_file, char delimiter)
needs to be one level up, outside of the while loop. This was resulting in strange behavior, ultimately not generating the correct amount of points in the output file. The correct code is:
cluster_set::cluster_set(std::ifstream & input_file, char delimiter) {
this->distribution = default_distribution;
std::string line;
std::vector<std::vector<double>> temp_centroid_vector;
while (std::getline(input_file, line)) {
while ((line.length() == 0) && !(input_file.eof())) {
std::getline(input_file, line); // Skips blank lines in file
}
std::string parameter;
std::stringstream ss(line);
std::vector<double> temp_point;
if ((line.length() != 0)) {
while (std::getline(ss, parameter, delimiter)) {
temp_point.push_back(atof(parameter.c_str()));
}
temp_centroid_vector.push_back(temp_point);
}
}
this->import_centroids(temp_centroid_vector); // THIS HAS BEEN CORRECTED
}
Incorrect Remainder Computation
rem
was being decremented within the while(subset)
loop, resulting in an erratically persistent remainder. Moving one level up, outside the loop, corrected this:
void cluster_set::clustergen(unsigned int k, std::ostream & output, char delimiter) {
if (this->centroids.empty()) {
output << "ERROR: No centroids have been imported. Aborting operation.";
return;
}
if (k < this->centroids.size()) { k = this->centroids.size(); }
unsigned int ct = 0;
const unsigned int n = k / this->centroids.size(); // Evenly distributes points across centroids
unsigned int rem = k % this->centroids.size(); // Evenly distributes points across centroids
for (auto centroid_iter = this->centroids.begin(); centroid_iter != this->centroids.end(); ++centroid_iter) {
unsigned int subset = n + (rem ? 1 : 0); // Evenly distributes points across centroids
while (subset) {
std::vector<double> temp_point;
for (auto dimension_iter = centroid_iter->begin(); dimension_iter != centroid_iter->end(); ++dimension_iter) {
temp_point.push_back(distribution(*dimension_iter));
}
for (auto temp_point_iter = temp_point.begin(); temp_point_iter != temp_point.end(); ++temp_point_iter) {
if (temp_point_iter != temp_point.begin()) { output << delimiter; }
output << (*temp_point_iter);
}
if (subset - 1) { output << "\n"; };
--subset;
}
if (rem) { --rem; } // THIS HAS BEEN CORRECTED
auto centroid_iter_peek = centroid_iter;
++centroid_iter_peek;
if (centroid_iter_peek != centroids.end()) { output << "\n"; };
}
}
Explore related questions
See similar questions with these tags.
std::random_device
is completely not random \$\endgroup\$std::random_device
is completely not random (think it returns a single int), there isn't much you can do, sadly (using only standard C++). It's definitely a flaw in mingw's implementation, even though it is standard compliant. You may be able to use some sort ofboost::random_device
, though \$\endgroup\$