[NOTE] This question can be depreciated in favor of version 0.31.
This is a code revision of a previous post and works well.
The purpose of this code is to produce a universe of points, randomly generated around predetermined centroids, provided from the user as either a vector of vectors or from a file. The final product is a file of sample points produced around the centroids, to be used for fake data analysis in another program. The objective here is brevity and speed. This most recent version is still under 200 lines, which I find acceptable but would prefer not to get much longer unless there is significant utility-gain reasoning.
Changes in This Version:
- Placed
#include
s in more appropriate locations (within the appropriate.h
or.cpp
and within the include guard) - Modified strongly away from procedural to OO. While this nearly doubled the code (from ~100 to ~200 lines), this simplified user parameter input. Overall, I feel the utility gain was worth it for the line-number increase.
- Point generation has been forced into a Normal distribution (for now - please see "Future Implementations" below)
- I went through a major debate whether to instance all points in a vector then output to file, or simply output to file on the fly. Eventually the later option won out for two reasons: it severely reduces memory requirements for mega-sets and it makes for less code in the overall program.
- The move to OO changed nearly every aspect of the code; this is very different than the previous versions, but maintains similar output.
My goal is two-fold: speed and conciseness. Speed takes precedence over conciseness, but thankfully they tend to go hand in hand.
Future Implementations:
- Concisification is an ongoing process.
- Reporting is intrusive throughout the program - I'm almost considering getting rid of the report file entirely to reduce code. Creating a new function for reporting will not reduce code as far as I can tell. Would love some thoughts on this...
- User-supplied distribution function (it is currently forced to Normal). This is a priority for me - I attempted to instance a function pointer in the
cluster_set
object along with a default function, which could then be tweaked by the user to whatever statistical distribution they desire. This failed miserably, so I've forced the distribution to Normality for the time being.
Shout out to Edward and Justin - your input on v0.2 was invaluable!
clustergen.h
#ifndef CLUSTERGEN_H
#define CLUSTERGEN_H
#include <fstream>
#include <vector>
#include <string>
class cluster_set {
std::vector<std::vector<double>> centroids; // Centroids around which to evenly generate all points
std::string file_out_str; // The output file name - only settable on construction
std::string file_rpt_str; // The report file name - only settable on construction
std::ofstream file_out; // Instanced on clustergen() to prevent multiple outputs to the same file
std::ofstream file_rpt; // Appends throughout program
char delim_in; // Defaults to CSV - only used in file import
char delim_out; // Defaults to CSV
public:
cluster_set(); // Default settings
cluster_set(const std::string &, const std::string &); // User specified output and report file
void import_centroids(const std::string &); // File import - uses vector import
void import_centroids(const std::vector<std::vector<double>> &); // Vector import - primary import algo
void clustergen(const unsigned int); // Primary algorithm - generates [unsigned int] number of points
void set_delim_in(const char d) { this->delim_in = d; }; // User specified delimiter for input
void set_delim_out(const char d) { this->delim_out = d; }; // User specified delimiter for output
};
#endif //CLUSTERGEN_H
clustergen.cpp
#include "clustergen.h"
#include <sstream>
#include <iostream>
#include <chrono>
#include <random>
// Default everything
cluster_set::cluster_set(void) {
this->file_out_str = "clustergen_out.dat";
this->file_rpt_str = "clustergen_rpt.dat";
this->delim_in = ',';
this->delim_out = ',';
this->file_rpt.open(this->file_rpt_str);
this->file_rpt << "CLUSTERGEN STATUS REPORT FOLLOWS";
this->file_rpt.close();
}
// User specified output and report file
cluster_set::cluster_set(const std::string &fout, const std::string &rout) :
file_out_str(fout),
file_rpt_str(rout) {
this->delim_in = ',';
this->delim_out = ',';
this->file_rpt.open(this->file_rpt_str);
this->file_rpt << "CLUSTERGEN STATUS REPORT FOLLOWS";
this->file_rpt.close();
}
// Reads in a file to a vector then passes to the primary centroid import function
void cluster_set::import_centroids(const std::string &fin) {
std::ifstream file_in(fin);
this->file_rpt.open(this->file_rpt_str, std::ios_base::app);
this->file_rpt << "\n" << "\n" << "Importing file to centroid vector: \"" << fin << "\" ...";
std::string line;
std::vector<std::vector<double>> temp_centroid_vec;
while (std::getline(file_in, line)) {
while ((line.length() == 0) && !(file_in.eof())) {
std::getline(file_in, line);
} // Skips blank lines in file
std::string param;
std::stringstream ss(line);
std::vector<double> temp_point;
if ((line.length() != 0)) {
while (std::getline(ss, param, this->delim_in)) {
temp_point.push_back(atof(param.c_str()));
}
temp_centroid_vec.push_back(temp_point);
}
}
this->file_rpt.close();
this->import_centroids(temp_centroid_vec);
}
// Primary centroid import function
void cluster_set::import_centroids(const std::vector<std::vector<double>> ¢roid_vec) {
this->file_rpt.open(this->file_rpt_str, std::ios_base::app);
this->file_rpt << "\n" << "\n" << "Importing centroid vector to universe:";
if (!(this->centroids.empty())) {
this->file_rpt << "\n" << " WARNING: Universe already initialized - flushing all clusters ...";
this->centroids.clear();
}
for (auto cent_vec_iter = centroid_vec.begin(); cent_vec_iter != centroid_vec.end(); ++cent_vec_iter) {
if (this->centroids.empty()) {
this->centroids.push_back(*cent_vec_iter);
} else if (cent_vec_iter->size() == this->centroids.front().size()) {
this->centroids.push_back(*cent_vec_iter);
}
}
this->file_rpt << "\n" << " " << this->centroids.size() << " Centroid imports SUCCESSFUL.";
this->file_rpt << "\n" << " " << centroid_vec.size() - this->centroids.size() << " Centroid imports FAILED.";
this->file_rpt.close();
}
// Primary cluster generator - aborts if no centroids exists.
void cluster_set::clustergen(const unsigned int k) {
static std::default_random_engine gen(
std::chrono::system_clock::now().time_since_epoch().count()); // Random seed
this->file_rpt.open(this->file_rpt_str, std::ios_base::app);
this->file_out.open(this->file_out_str);
this->file_rpt << "\n" << "\n" << "Generating Points:";
if (this->centroids.empty()) {
this->file_rpt << "\n" << "ERROR: No centroids have been imported. Aborting operation.";
return;
}
const unsigned int n = k / this->centroids.size();
unsigned rem = k % this->centroids.size();
unsigned int centroid_ct = 0;
for (auto centroid_iter = this->centroids.begin(); centroid_iter != this->centroids.end(); ++centroid_iter) {
unsigned int subset = n + (rem ? 1 : 0);
this->file_rpt << "\n" << " Generating " << subset << " points around centroid " << ++centroid_ct
<< " ...";
while (subset) {
std::vector<double> temp_point;
for (auto dimension_iter = centroid_iter->begin();
dimension_iter != centroid_iter->end(); ++dimension_iter) {
// Would love to put something here to allow users to specify their own distribution...
std::normal_distribution<double> distr(*dimension_iter, 10);
temp_point.push_back(distr(gen));
}
for (auto temp_point_iter = temp_point.begin();
temp_point_iter != temp_point.end(); ++temp_point_iter) {
if (temp_point_iter != temp_point.begin()) { this->file_out << this->delim_out; }
this->file_out << (*temp_point_iter);
}
if (subset - 1) { this->file_out << "\n"; };
--subset;
if (rem) { --rem; }
}
this->file_rpt << " Done.";
auto centroid_iter_peek = centroid_iter;
++centroid_iter_peek;
if (centroid_iter_peek != centroids.end()) { this->file_out << "\n"; };
}
this->file_rpt << "\n\n " << k << " total points generated around " << centroid_ct << " centroids.";
this->file_out.close();
this->file_rpt.close();
}
main.h
(Contains a few examples)
#include "clustergen.h"
int main() {
std::vector<std::vector<double>> v = {{11, 22},
{33, 44},
{55, 66, 77},
{88, 99},
{100}};
std::vector<std::vector<double>> v2 = {{110, 120},
{130, 140},
{150, 160, 170},
{180, 190},
{200}};
cluster_set my_clusters;
// Vector import with default out file, reporting file, and delimiters.
my_clusters.import_centroids(v);
my_clusters.clustergen(10);
cluster_set my_clusters1("1_clustergen_out.dat", "1_clustergen_rpt.dat");
// Vector import with user specified out file and reporting file, and default delimiters.
my_clusters1.import_centroids(v2);
my_clusters1.clustergen(11);
cluster_set my_clusters2("2_clustergen_out.dat", "2_clustergen_rpt.dat");
// Vector import with user specified out file, reporting file, and delimiters.
my_clusters2.set_delim_in(' ');
my_clusters2.set_delim_out('@');
my_clusters2.import_centroids(v);
my_clusters2.clustergen(12);
cluster_set my_clusters3("3_clustergen_out.dat", "3_clustergen_rpt.dat");
// File import with default out file, reporting file, and delimiters.
my_clusters3.set_delim_in(' ');
my_clusters3.import_centroids("clustergen_in.dat");
my_clusters3.clustergen(13);
cluster_set my_clusters4("4_clustergen_out.dat", "4_clustergen_rpt.dat");
// File import with user specified out file and reporting file, and default delimiters.
my_clusters4.set_delim_in('#');
my_clusters4.import_centroids("clustergen_in2.dat");
my_clusters4.clustergen(14);
cluster_set my_clusters5("5_clustergen_out.dat", "5_clustergen_rpt.dat");
// File import with user specified out file, reporting file, and delimiters.
my_clusters5.set_delim_in('#');
my_clusters5.set_delim_out('@');
my_clusters5.import_centroids("clustergen_in2.dat");
my_clusters5.clustergen(15);
}
clustergen_in.dat
(Used in the main.h
examples above)
210 220
230 240
250 260 270
280 290
200
clustergen_in2.dat
(Used in the main.h
examples above)
210#220
230#240
250#260#270
280#290
200
-
\$\begingroup\$ Fair enough - thank you for the advice. This is still a little new to me... \$\endgroup\$Miller– Miller2017年08月16日 21:09:32 +00:00Commented Aug 16, 2017 at 21:09
-
\$\begingroup\$ Please note, this question can be depreciated in favor of version 0.31. \$\endgroup\$Miller– Miller2017年08月17日 17:44:36 +00:00Commented Aug 17, 2017 at 17:44
1 Answer 1
#include <fstream> #include <vector> #include <string>
It's really nice to see that you are only #include
ing what you need. You could sort the includes alphabetically, but that really doesn't matter.
class cluster_set { ... std::string file_out_str; // The output file name - only settable on construction std::string file_rpt_str; // The report file name - only settable on construction std::ofstream file_out; // Instanced on clustergen() to prevent multiple outputs to the same file std::ofstream file_rpt; // Appends throughout program
file_out_str
and file_rpt_str
are only used to initialize your ofstream
s. They don't need to be member variables.
The names file_out
and file_rpt
are not the best. First, what does "rpt" mean? Repeat? Report? Ramco-Gershenson Properties Trust (that's what comes up from a Google search of "rpt")? Second, it's hard to use file_out
in an English sentence:
Now, I want to take file_out and write this data to it.
If you change it around, it flows more naturally:
Now, I want to take this outputFile and write this data to it.
cluster_set(const std::string &, const std::string &); // User specified output and report file
The comment here isn't that useful if you instead name the arguments:
cluster_set(const std::string & outputFileName, const std::string & reportFileName);
And you can use default arguments to remove the need for a default constructor:
cluster_set(const std::string & outputFileName = "clustergen_out.dat",
const std::string & reportFileName = "clustergen_rpt.dat");
Also, it's a lot more general to take a std::ostream&
than it is to take a string filename. With a std::ostream&
, the user of your utility could use std::cout
, for instance, and it makes testing easier.
That would mean that you'd have to change file_out
and file_rpt
to be std::ostream&
s, but having references as data members can have a couple problems, so you might want to use a std::reference_wrapper<std::ostream>
or a std::ostream*
.
void import_centroids(const std::string &); // File import - uses vector import void import_centroids(const std::vector<std::vector<double>> &); // Vector import - primary import algo
These functions imply that there is some setup needed to actually use a cluster_set
. However, it is a good idea to have all setup be finished at the end of your constructor. These functions should be removed in favor of an extra constructor parameter of const std::vector<std::vector<double>>&
; to import from file, you can have a free function:
std::vector<std::vector<double>> read_centroids(std::istream& inputFile);
// or perhaps:
std::vector<std::vector<double>> read_centroids(const std::string & inputFileName) {
// with both functions, this would be an implementation of it;
// just to show how easy it is to use a `std::istream` or `std::ostream` parameter
std::ifstream file{inputFileName};
return read_centroids(file);
}
void clustergen(const unsigned int); // Primary algorithm - generates [unsigned int] number of points
Minor nitpick: in your header files, you usually don't want to take parameters by const
and by value. Just make the parameter unsigned int
, then when you actually implement cluster_set::clustergen
, you can mark the parameter const
.
You might consider moving the file_out
and file_rpt
member variables to be parameters of this function. It depends on how you want your cluster_set
to behave. By having file_out
and file_rpt
as member variables, you have an object with these properties:
- Can write data to
file_out
andfile_rpt
, given ourunsigned int
parameter.
By having file_out
and file_rpt
as parameters to this function, you have an object with these properties:
- Models a bunch of cluster information
- Can write the cluster information to a file, given the file information
cluster_set::cluster_set(void) {
In C++, don't mark functions taking 0 arguments with a void
in the parameter pack.
this->file_out_str = "clustergen_out.dat"; this->file_rpt_str = "clustergen_rpt.dat"; this->delim_in = ','; this->delim_out = ','; this->file_rpt.open(this->file_rpt_str); this->file_rpt << "CLUSTERGEN STATUS REPORT FOLLOWS"; this->file_rpt.close();
You should use the member initializer list. You don't need all the this->
, but some people do prefer having it.
cluster_set::cluster_set()
: file_out_str{ "clustergen_out.dat" }
, file_rpt_str{ "culstergen_rpt.dat" }
, delim_in{ ',' } // note that you may get a warning about this not being the same order as in the class; just change the order here to fix that
, delim_out{ ',' }
, file_rpt{ file_rpt_str } // this is where the order matters; `file_rpt` has to be after `file_rpt_str` in the class declaration for this to work
{
file_rpt << "CLUSTERGEN STATUS REPORT FOLLOWS";
file_rpt.close();
}
Also, it's strange to close file_rpt
already, since we are going to be using it later.
Also, this constructor can be implemented in terms of the other:
cluster_set::cluster_set()
: cluster_set{"clustergen_out.dat", "clustergen_rpt.dat"}
{}
// Primary cluster generator - aborts if no centroids exists. void cluster_set::clustergen(const unsigned int k) { static std::default_random_engine gen( std::chrono::system_clock::now().time_since_epoch().count()); // Random seed
It's nice to see that you make the random engine static
. It is, however, not the best to initialize it with the time. You should probably initialize it with a std::random_device
, which uses /dev/urandom
or the equivalent. You may also want to use std::mt19937
.
It might look like this:
void cluster_set::clustergen(const unsigned int k) {
static std::default_random_engine gen{std::random_device{}()};
That may very well not be enough entropy to fully initialize the random engine, though (it will not for std::mt19937
), but to actually seed the engine is annoyingly difficult in C++ as of now. At any rate, this is better than using the current time.
If you wanted to let users specify their own distribution, you can make your function into a template function and take a function parameter to make the distribution:
template <typename DistributionCreator>
void clustergen(const unsigned int k, DistributionCreator createDistribution) {
static std::default_random_engine gen(
std::chrono::system_clock::now().time_since_epoch().count());
// ...
for (auto dimension_iter = centroid_iter->begin();
dimension_iter != centroid_iter->end(); ++dimension_iter) {
auto distr = createDistribution(*dimension_iter);
temp_point.push_back(distr(gen));
}
// ...
}
Although this would then have to be defined in the header file, and you'd probably want to move the random engine to be in a different function, so that there is only one instance of it:
private:
static std::default_random_engine & gen();
template <typename DistributionCreator>
void clustergen(const unsigned int k, DistributionCreator createDistribution) {
// ...
for (auto dimension_iter = centroid_iter->begin();
dimension_iter != centroid_iter->end(); ++dimension_iter) {
auto distr = createDistribution(*dimension_iter);
temp_point.push_back(distr(gen()));
}
// ...
}
Alternatively, you could create something like a DistributionProvider
interface and pass in a reference to that.
-
\$\begingroup\$ Thank you again, Justin! I'll work on this over the next few days - a quick note on the
std::string
vs.std::ostream
argument: I was attempting to concisify the setup inmain()
. All the user has to do currently is pass a string (the filename) - it also appears to make opening, closing, and appending a file a little easier as this happens many times throughout the program. But if I'm hearing you correctly, passing an ostream is more standard? \$\endgroup\$Miller– Miller2017年08月16日 21:05:50 +00:00Commented Aug 16, 2017 at 21:05 -
1\$\begingroup\$ @Miller Passing a
std::ostream&
is more general. It makes for a better interface. \$\endgroup\$Justin– Justin2017年08月16日 21:08:15 +00:00Commented Aug 16, 2017 at 21:08 -
\$\begingroup\$ One more quick note, @Justin. I'm taking your "minor nitpick" above into account while recoding, but it doesn't let me compile. Am I understanding correctly: you were saying delete the
const
s in the .h file, but leave them only in the .cpp? \$\endgroup\$Miller– Miller2017年08月17日 00:20:22 +00:00Commented Aug 17, 2017 at 0:20 -
2\$\begingroup\$ @Miller I'm saying that if you have some type
T
, and you have a parameterconst T parameter
because you don't modify it in the body of the function, just remove theconst
in the header file, and have it there in thecpp
file. Theconst
is an implementation detail. I was referring to that one specific parameter. For example: coliru.stacked-crooked.com/a/8d828cd0dc709715 \$\endgroup\$Justin– Justin2017年08月17日 00:31:48 +00:00Commented Aug 17, 2017 at 0:31 -
\$\begingroup\$ Gotcha - I was having trouble implementing this same thing on the vectors and ostreams, but it works with the
unsigned int
. I'm vastly reworking the code (again) with your suggestions and to get rid of reporting. \$\endgroup\$Miller– Miller2017年08月17日 00:36:58 +00:00Commented Aug 17, 2017 at 0:36
Explore related questions
See similar questions with these tags.