Python has amazing sci-kit learn library but I am building some projects on C++ with involves some machine learning algorithms. I found machine learning libraries in C++ involves more dependencies so I have decided to implement a library without dependencies It would be nice if you could review the code.
I have implemented least squares regression below.
lsr.h
// SWAMI KARUPPASWAMI THUNNAI
#pragma once
#include <iostream>
#include <string>
#include <sstream>
#include <fstream>
#include <vector>
/*
Implementation of simple linear regression or Least Squares Regression
Written By: Visweswaran N @ 2019年08月26日
*/
class simple_linear_regression
{
private:
// ~~~~~~~~~~~~~ PRIVATE VARIABLES ~~~~~~~~~~~~
unsigned int N=0.0;
// Independent and dependent variable
std::vector<double> X;
std::vector<double> y;
// XX and Xy
std::vector<double> XX;
std::vector<double> Xy;
// To print into console
bool verbose;
// Summation
double sigma_X = 0.0;
double sigma_y = 0.0;
double sigma_XX = 0.0;
double sigma_Xy = 0.0;
// Slope
double m = 0.0;
// Bias (or) Intercept
double b = 0.0;
// ~~~~~~~~~~ Methods ~~~~~~~~~~~~~
void print(std::string message);
void calculate_N();
void x_square();
void x_cross_y();
void calculate_sigma();
void calculate_slope();
void calculate_bias();
public:
simple_linear_regression(std::string model_name);
simple_linear_regression(std::vector<double> X, std::vector<double> y, bool verbose = false);
void train();
double predict(double _X);
void save_model(std::string file_name);
};
lsr.cpp
// SWAMI KARUPPASWAMI THUNNAI
#include "lsr.h"
void simple_linear_regression::print(std::string message)
{
if (this->verbose) std::cout << message << "\n";
}
void simple_linear_regression::calculate_N()
{
if (this->X.size() == this->y.size())
{
this->N = this->X.size();
}
else
{
throw "SIZE MISMATCH";
}
}
// Calculates the X square and saves it in XX vector
void simple_linear_regression::x_square()
{
print("Calculating X*X");
for (double i : this->X)
{
this->XX.push_back(i*i);
}
}
// Calculates X*y and stores it in Xy vector
void simple_linear_regression::x_cross_y()
{
print("Calculating X*y");
for (unsigned int index = 0; index < N; index++)
{
this->Xy.push_back(this->X[index] * this->y[index]);
}
}
// Calculate all the summation
void simple_linear_regression::calculate_sigma()
{
// Calculate summation of X
print("Calculating sigma for X");
for (double i : this->X)
{
this->sigma_X += i;
}
// Calculate the summation of Y
print("Calculating sigma for Y");
for (double i : this->y)
{
this->sigma_y += i;
}
// Calculate the summation of XX
print("Calculating sigma for X*X");
for (double i : this->XX)
{
this->sigma_XX += i;
}
// Calculate the summation of Xy
print("Calculating sigma for X*y");
for (double i : this->Xy)
{
this->sigma_Xy += i;
}
}
// Calculate our slope - m
void simple_linear_regression::calculate_slope()
{
print("Calculating slope");
this->m = ((this->N * this->sigma_Xy) - (this->sigma_X * this->sigma_y)) / ((this->N * this->sigma_XX) - (this->sigma_X * this->sigma_X));
}
// Calculate our intercept (or) bias
void simple_linear_regression::calculate_bias()
{
print("Calculating Intercept (or) Bias");
this->b = (sigma_y - (this->m * this->sigma_X)) / this->N;
}
// Constructor to load existing model
simple_linear_regression::simple_linear_regression(std::string model_name)
{
std::ifstream file;
file.open(model_name);
if (!file.is_open()) throw "Model cannot be opened!";
std::stringstream stream;
stream << file.rdbuf();
std::string values = stream.str();
unsigned int index = values.find(",");
std::string _m = "";
std::string _b = "";
for (unsigned int i = 0; i < index; i++)
{
_m += values[i];
}
for (unsigned int i = index+1; i < values.size(); i++)
{
_b += values[i];
}
this->m = std::stod(_m);
this->b = std::stod(_b);
}
// constructor for start training new model
simple_linear_regression::simple_linear_regression(std::vector<double> X, std::vector<double> y, bool verbose)
{
this->X = X;
this->y = y;
this->verbose = verbose;
}
// method to train our model, where all methods merges.
void simple_linear_regression::train()
{
print("Training");
this->calculate_N();
this->x_square();
this->x_cross_y();
this->calculate_sigma();
this->calculate_slope();
this->calculate_bias();
print("Model has been trained :)");
}
// Method used for predicting future values
double simple_linear_regression::predict(double _X)
{
return (this->m * _X) + this->b;
}
// A method used to save model so that we do not need to retrain it
void simple_linear_regression::save_model(std::string file_name)
{
std::ofstream file;
file.open(file_name);
if (!file.is_open()) throw "Model cannot be saved because file cannot be opened to write.Check for permissions and make sure the file is not open!";
file << this->m << "," << this->b;
file.close();
}
Training and saving the model
// SWAMI KARUPPASWAMI THUNNAI
#include "lsr.h"
int main()
{
// X, y, print_debug messages
simple_linear_regression slr({2, 3, 5, 7, 9}, {4, 5, 7, 10, 15}, true);
slr.train();
std::cout << slr.predict(8);
slr.save_model("model.txt");
}
Loading the saved model:
// SWAMI KARUPPASWAMI THUNNAI
#include "lsr.h"
int main()
{
// X, y, print_debug messages
simple_linear_regression slr("model.txt");
std::cout << slr.predict(8);
}
Example output:
Training Calculating X*X Calculating X*y Calculating sigma for X
Calculating sigma for Y Calculating sigma for X*X Calculating sigma
for X*y Calculating slope Calculating Intercept (or) Bias Model has
been trained :)
1.82317
4 Answers 4
Comments
Please be sure your comments add something meaningful. For example:
private:
// ~~~~~~~~~~~~~ PRIVATE VARIABLES ~~~~~~~~~~~~
unsigned int N=0.0;
In this case, the private:
already makes it pretty clear that what follows is private. The comment isn't adding anything that wasn't already pretty obvious.
At least in the typical case, the code already answers "what" kinds of questions (e.g., public: int n = 0;
tells us that N
is an int that's publicly accessible.
A comment should usually answer questions about "why?" Why does the code do one thing, when another might seem more obvious. Why did you make N
public, even though public variables are often rather frowned upon? (and so on).
Type matching
Looking at the previous code snippet, we also note that we have N
defined as an int
, but being initialized with the value 0.0
, which is a double
. Unless you have a reason to do otherwise, it's generally preferable to have the types match, so you'd use just int N = 0;
.
As an aside, I'd be less concerned with mismatches that are really just different sizes of what are otherwise similar types. For example, something like unsigned long long foo = 0;
is reasonable event though 0
if of type int
rather than unsigned long long
. But initializing an integer type with a floating point value tends to suggest some confusion about whether that variable should really be able to hold a fractional value or not.
Interface Design
One general piece of guidance about interface design is that you should strive for an interface that's easy to use, and hard to misuse.
So, at least as I see things, the public interface of a linear regression model consists primarily of only two functions: construct the model from some input, and get a prediction from a constructed model. As it stands right now, your class requires that the user explicitly call train
before they can use a model. And unfortunately, it looks like they can try to make a prediction without having called train
, which (of course) won't work.
At a more detailed level, rather than starting with:
std::vector<double> X;
std::vector<double> y;
...and then assuring those are the same size in simple_linear_regression::calculate_N()
, I'd prefer to start with a simple struct:
struct variable {
double x; // independent
double y; // dependent
};
Then have a single std::vector<variable>
for the rest of the code to work with. This way, it's simply impossible to create mis-matched number of independent/dependent variables, so there's no need for the rest of the code to check that the counts are equal.
Avoid Boolean parameters
With code like:
simple_linear_regression slr({2, 3, 5, 7, 9}, {4, 5, 7, 10, 15}, true);
It's not immediately obvious what true
vs. false
as the last parameter means. I'd prefer something like:
simple_linear_regression slr({2, 3, 5, 7, 9}, {4, 5, 7, 10, 15}, DEBUG);
or:
simple_linear_regression slr({2, 3, 5, 7, 9}, {4, 5, 7, 10, 15}, NODEBUG);
Of course, taking the previous point into account, this would be more like:
simple_linear_regression slr({{2,4},{3,5}, {5, 7}, {7, 10}, {9, 15}}, NODEBUG};
Naming
I'm not really excited about using the name train
for computing a linear regression. train
is typically used with things like neural networks to mean something that's just similar enough that the name could easily mislead somebody into believing your code is something rather different than it really is (e.g., something that trains a neural network to simulate linear regression).
As previously noted, however, I'd prefer to eliminate training as a separate step in any case.
Prefer initialization over assignment
Consider your constructor:
simple_linear_regression::simple_linear_regression(std::vector<double> X, std::vector<double> y, bool verbose)
{
this->X = X;
this->y = y;
this->verbose = verbose;
}
Rather than assigning values inside the body of the ctor, I'd prefer to use a member initializer list:
simple_linear_regression::simple_linear_regression(std::vector<double> X,
std::vector<double> y,
bool verbose)
: X(X)
, Y(Y)
, verbose(verbose)
{}
C++ is not Java. All these
this->
can be safely omitted.Most of the methods of
lsr.cpp
are better expressed via STL numeric algorithms. For example,calculate_sigma
is 4 calls tostd::accumulate
, whilex_square
andx_cross_y
are calls tostd::inner_product
.From a numerical point of view, following the formulas naively may lead to significant computational errors. The intermediate results, e.g.
sigma_XX
andsigma_X * sigma_X
could be huge. Thus the result of(this->N * this->sigma_XX) - (this->sigma_X * this->sigma_X)
is already compromised, and using it as denominator further amplifies the (relative) error.
In any case, the computational result is only as good as its error margin. You must compute it as well.
The entire library is just that, a library. I don't see a compelling reason to wrap it in a class. Indeed, the data members are transient; they better be local variables.
If you insist on a class,
calculate_N
should be called in a constructor.It feels strange that the saved model (which is just slope and intercept) is completely disconnected from the data on which it's been trained.
Nitpick:
lsr.[ch]
should beslr.[ch]
.
-
\$\begingroup\$ I agree with your point about transient member variables. Much of the design looks very much like an end run around the usual admonitions about global variables. \$\endgroup\$Chris– Chris2025年08月09日 20:45:28 +00:00Commented Aug 9 at 20:45
How about the numerical type other than double
?
Your simple_linear_regression
already fixed the data type for regression is double
. However, how about float
or long double
cases? You can make your simple_linear_regression
class into a class template which make numerical type into template parameter like:
<template NumericalType>
class simple_linear_regression
Moreover, if you can use C++20 or above, you can make the simple_linear_regression
class be a constraint class template like:
<template NumericalType>
requires (std::floating_point<NumericalType> || std::integral<NumericalType>)
class simple_linear_regression
Please use subclasses of std::exception
when throwing. Although C++ permits
throw "SIZE MISMATCH";
this leads to code that's difficult to use (especially if all exception paths throw string literal types - it becomes hard to handle particular subsets).