Simple Linear Regression in C++

Question 1

Python has amazing sci-kit learn library but I am building some projects on C++ with involves some machine learning algorithms. I found machine learning libraries in C++ involves more dependencies so I have decided to implement a library without dependencies It would be nice if you could review the code.

I have implemented least squares regression below.

lsr.h

// SWAMI KARUPPASWAMI THUNNAI
#pragma once
#include <iostream>
#include <string>
#include <sstream>
#include <fstream>
#include <vector>
/*
Implementation of simple linear regression or Least Squares Regression
Written By: Visweswaran N @ 2019年08月26日
*/
class simple_linear_regression
{
private:
 // ~~~~~~~~~~~~~ PRIVATE VARIABLES ~~~~~~~~~~~~
 unsigned int N=0.0;
 // Independent and dependent variable
 std::vector<double> X;
 std::vector<double> y;
 // XX and Xy
 std::vector<double> XX;
 std::vector<double> Xy;
 // To print into console
 bool verbose;
 // Summation
 double sigma_X = 0.0;
 double sigma_y = 0.0;
 double sigma_XX = 0.0;
 double sigma_Xy = 0.0;
 // Slope
 double m = 0.0;
 // Bias (or) Intercept
 double b = 0.0;
 // ~~~~~~~~~~ Methods ~~~~~~~~~~~~~
 void print(std::string message);
 void calculate_N();
 void x_square();
 void x_cross_y();
 void calculate_sigma();
 void calculate_slope();
 void calculate_bias();
public:
 simple_linear_regression(std::string model_name);
 simple_linear_regression(std::vector<double> X, std::vector<double> y, bool verbose = false);
 void train();
 double predict(double _X);
 void save_model(std::string file_name);
};

lsr.cpp

// SWAMI KARUPPASWAMI THUNNAI
#include "lsr.h"
void simple_linear_regression::print(std::string message)
{
 if (this->verbose) std::cout << message << "\n";
}
void simple_linear_regression::calculate_N()
{
 if (this->X.size() == this->y.size())
 {
 this->N = this->X.size();
 }
 else
 {
 throw "SIZE MISMATCH";
 }
}
// Calculates the X square and saves it in XX vector
void simple_linear_regression::x_square()
{
 print("Calculating X*X");
 for (double i : this->X)
 {
 this->XX.push_back(i*i);
 }
}
// Calculates X*y and stores it in Xy vector
void simple_linear_regression::x_cross_y()
{
 print("Calculating X*y");
 for (unsigned int index = 0; index < N; index++)
 {
 this->Xy.push_back(this->X[index] * this->y[index]);
 }
}
// Calculate all the summation
void simple_linear_regression::calculate_sigma()
{
 // Calculate summation of X
 print("Calculating sigma for X");
 for (double i : this->X)
 {
 this->sigma_X += i;
 }
 // Calculate the summation of Y
 print("Calculating sigma for Y");
 for (double i : this->y)
 {
 this->sigma_y += i;
 }
 // Calculate the summation of XX
 print("Calculating sigma for X*X");
 for (double i : this->XX)
 {
 this->sigma_XX += i;
 }
 // Calculate the summation of Xy
 print("Calculating sigma for X*y");
 for (double i : this->Xy)
 {
 this->sigma_Xy += i;
 }
}
// Calculate our slope - m
void simple_linear_regression::calculate_slope()
{
 print("Calculating slope");
 this->m = ((this->N * this->sigma_Xy) - (this->sigma_X * this->sigma_y)) / ((this->N * this->sigma_XX) - (this->sigma_X * this->sigma_X)); 
}
// Calculate our intercept (or) bias
void simple_linear_regression::calculate_bias()
{
 print("Calculating Intercept (or) Bias");
 this->b = (sigma_y - (this->m * this->sigma_X)) / this->N;
}
// Constructor to load existing model
simple_linear_regression::simple_linear_regression(std::string model_name)
{
 std::ifstream file;
 file.open(model_name);
 if (!file.is_open()) throw "Model cannot be opened!";
 std::stringstream stream;
 stream << file.rdbuf();
 std::string values = stream.str();
 unsigned int index = values.find(",");
 std::string _m = "";
 std::string _b = "";
 for (unsigned int i = 0; i < index; i++)
 {
 _m += values[i];
 }
 for (unsigned int i = index+1; i < values.size(); i++)
 {
 _b += values[i];
 }
 this->m = std::stod(_m);
 this->b = std::stod(_b);
}
// constructor for start training new model
simple_linear_regression::simple_linear_regression(std::vector<double> X, std::vector<double> y, bool verbose)
{
 this->X = X;
 this->y = y;
 this->verbose = verbose;
}
// method to train our model, where all methods merges. 
void simple_linear_regression::train()
{
 print("Training");
 this->calculate_N();
 this->x_square();
 this->x_cross_y();
 this->calculate_sigma();
 this->calculate_slope();
 this->calculate_bias();
 print("Model has been trained :)");
}
// Method used for predicting future values
double simple_linear_regression::predict(double _X)
{
 return (this->m * _X) + this->b;
}
// A method used to save model so that we do not need to retrain it
void simple_linear_regression::save_model(std::string file_name)
{
 std::ofstream file;
 file.open(file_name);
 if (!file.is_open()) throw "Model cannot be saved because file cannot be opened to write.Check for permissions and make sure the file is not open!";
 file << this->m << "," << this->b;
 file.close();
}

Training and saving the model

// SWAMI KARUPPASWAMI THUNNAI
#include "lsr.h"
int main()
{
 // X, y, print_debug messages
 simple_linear_regression slr({2, 3, 5, 7, 9}, {4, 5, 7, 10, 15}, true);
 slr.train();
 std::cout << slr.predict(8);
 slr.save_model("model.txt");
}

Loading the saved model:

// SWAMI KARUPPASWAMI THUNNAI
#include "lsr.h"
int main()
{
 // X, y, print_debug messages
 simple_linear_regression slr("model.txt");
 std::cout << slr.predict(8);
}

Example output:

Training Calculating X*X Calculating X*y Calculating sigma for X
Calculating sigma for Y Calculating sigma for X*X Calculating sigma
for X*y Calculating slope Calculating Intercept (or) Bias Model has
been trained :)
1.82317

Github for easy code download

Question 2

Comments

Please be sure your comments add something meaningful. For example:

private:
 // ~~~~~~~~~~~~~ PRIVATE VARIABLES ~~~~~~~~~~~~
 unsigned int N=0.0;

In this case, the private: already makes it pretty clear that what follows is private. The comment isn't adding anything that wasn't already pretty obvious.

At least in the typical case, the code already answers "what" kinds of questions (e.g., public: int n = 0; tells us that N is an int that's publicly accessible.

A comment should usually answer questions about "why?" Why does the code do one thing, when another might seem more obvious. Why did you make N public, even though public variables are often rather frowned upon? (and so on).

Type matching

Looking at the previous code snippet, we also note that we have N defined as an int, but being initialized with the value 0.0, which is a double. Unless you have a reason to do otherwise, it's generally preferable to have the types match, so you'd use just int N = 0;.

As an aside, I'd be less concerned with mismatches that are really just different sizes of what are otherwise similar types. For example, something like unsigned long long foo = 0; is reasonable event though 0 if of type int rather than unsigned long long. But initializing an integer type with a floating point value tends to suggest some confusion about whether that variable should really be able to hold a fractional value or not.

Interface Design

One general piece of guidance about interface design is that you should strive for an interface that's easy to use, and hard to misuse.

So, at least as I see things, the public interface of a linear regression model consists primarily of only two functions: construct the model from some input, and get a prediction from a constructed model. As it stands right now, your class requires that the user explicitly call train before they can use a model. And unfortunately, it looks like they can try to make a prediction without having called train, which (of course) won't work.

At a more detailed level, rather than starting with:

std::vector<double> X;
std::vector<double> y;

...and then assuring those are the same size in simple_linear_regression::calculate_N(), I'd prefer to start with a simple struct:

struct variable { 
 double x; // independent
 double y; // dependent
};

Then have a single std::vector<variable> for the rest of the code to work with. This way, it's simply impossible to create mis-matched number of independent/dependent variables, so there's no need for the rest of the code to check that the counts are equal.

Avoid Boolean parameters

With code like:

simple_linear_regression slr({2, 3, 5, 7, 9}, {4, 5, 7, 10, 15}, true);

It's not immediately obvious what true vs. false as the last parameter means. I'd prefer something like:

simple_linear_regression slr({2, 3, 5, 7, 9}, {4, 5, 7, 10, 15}, DEBUG);

or:

simple_linear_regression slr({2, 3, 5, 7, 9}, {4, 5, 7, 10, 15}, NODEBUG);

Of course, taking the previous point into account, this would be more like:

simple_linear_regression slr({{2,4},{3,5}, {5, 7}, {7, 10}, {9, 15}}, NODEBUG};

Naming

I'm not really excited about using the name train for computing a linear regression. train is typically used with things like neural networks to mean something that's just similar enough that the name could easily mislead somebody into believing your code is something rather different than it really is (e.g., something that trains a neural network to simulate linear regression).

As previously noted, however, I'd prefer to eliminate training as a separate step in any case.

Prefer initialization over assignment

Consider your constructor:

simple_linear_regression::simple_linear_regression(std::vector<double> X, std::vector<double> y, bool verbose)
{
 this->X = X;
 this->y = y;
 this->verbose = verbose;
}

Rather than assigning values inside the body of the ctor, I'd prefer to use a member initializer list:

simple_linear_regression::simple_linear_regression(std::vector<double> X,
 std::vector<double> y, 
 bool verbose)
 : X(X)
 , Y(Y)
 , verbose(verbose)
{}

Question 3

C++ is not Java. All these this-> can be safely omitted.
Most of the methods of lsr.cpp are better expressed via STL numeric algorithms. For example, calculate_sigma is 4 calls to std::accumulate, while x_square and x_cross_y are calls to std::inner_product.
From a numerical point of view, following the formulas naively may lead to significant computational errors. The intermediate results, e.g. sigma_XX and sigma_X * sigma_X could be huge. Thus the result of
```
 (this->N * this->sigma_XX) - (this->sigma_X * this->sigma_X)
```
is already compromised, and using it as denominator further amplifies the (relative) error.
In any case, the computational result is only as good as its error margin. You must compute it as well.
The entire library is just that, a library. I don't see a compelling reason to wrap it in a class. Indeed, the data members are transient; they better be local variables.

If you insist on a class, calculate_N should be called in a constructor.
It feels strange that the saved model (which is just slope and intercept) is completely disconnected from the data on which it's been trained.
Nitpick: lsr.[ch] should be slr.[ch].

Question 4

I agree with your point about transient member variables. Much of the design looks very much like an end run around the usual admonitions about global variables.

Question 5

How about the numerical type other than `double`?

Your simple_linear_regression already fixed the data type for regression is double. However, how about float or long double cases? You can make your simple_linear_regression class into a class template which make numerical type into template parameter like:

<template NumericalType>
class simple_linear_regression

Moreover, if you can use C++20 or above, you can make the simple_linear_regression class be a constraint class template like:

<template NumericalType>
requires (std::floating_point<NumericalType> || std::integral<NumericalType>)
class simple_linear_regression

Question 6

Please use subclasses of std::exception when throwing. Although C++ permits

 throw "SIZE MISMATCH";

this leads to code that's difficult to use (especially if all exception paths throw string literal types - it becomes hard to handle particular subsets).

Jerry Coffin Jerry Coffin 34.1k4 gold badges77 silver badges144 bronze badges · Accepted Answer · 2019-08-26 18:36:12Z

Comments

Please be sure your comments add something meaningful. For example:

private:
 // ~~~~~~~~~~~~~ PRIVATE VARIABLES ~~~~~~~~~~~~
 unsigned int N=0.0;

In this case, the private: already makes it pretty clear that what follows is private. The comment isn't adding anything that wasn't already pretty obvious.

At least in the typical case, the code already answers "what" kinds of questions (e.g., public: int n = 0; tells us that N is an int that's publicly accessible.

A comment should usually answer questions about "why?" Why does the code do one thing, when another might seem more obvious. Why did you make N public, even though public variables are often rather frowned upon? (and so on).

Type matching

Looking at the previous code snippet, we also note that we have N defined as an int, but being initialized with the value 0.0, which is a double. Unless you have a reason to do otherwise, it's generally preferable to have the types match, so you'd use just int N = 0;.

As an aside, I'd be less concerned with mismatches that are really just different sizes of what are otherwise similar types. For example, something like unsigned long long foo = 0; is reasonable event though 0 if of type int rather than unsigned long long. But initializing an integer type with a floating point value tends to suggest some confusion about whether that variable should really be able to hold a fractional value or not.

Interface Design

One general piece of guidance about interface design is that you should strive for an interface that's easy to use, and hard to misuse.

So, at least as I see things, the public interface of a linear regression model consists primarily of only two functions: construct the model from some input, and get a prediction from a constructed model. As it stands right now, your class requires that the user explicitly call train before they can use a model. And unfortunately, it looks like they can try to make a prediction without having called train, which (of course) won't work.

At a more detailed level, rather than starting with:

std::vector<double> X;
std::vector<double> y;

...and then assuring those are the same size in simple_linear_regression::calculate_N(), I'd prefer to start with a simple struct:

struct variable { 
 double x; // independent
 double y; // dependent
};

Then have a single std::vector<variable> for the rest of the code to work with. This way, it's simply impossible to create mis-matched number of independent/dependent variables, so there's no need for the rest of the code to check that the counts are equal.

Avoid Boolean parameters

With code like:

simple_linear_regression slr({2, 3, 5, 7, 9}, {4, 5, 7, 10, 15}, true);

It's not immediately obvious what true vs. false as the last parameter means. I'd prefer something like:

simple_linear_regression slr({2, 3, 5, 7, 9}, {4, 5, 7, 10, 15}, DEBUG);

or:

simple_linear_regression slr({2, 3, 5, 7, 9}, {4, 5, 7, 10, 15}, NODEBUG);

Of course, taking the previous point into account, this would be more like:

simple_linear_regression slr({{2,4},{3,5}, {5, 7}, {7, 10}, {9, 15}}, NODEBUG};

Naming

I'm not really excited about using the name train for computing a linear regression. train is typically used with things like neural networks to mean something that's just similar enough that the name could easily mislead somebody into believing your code is something rather different than it really is (e.g., something that trains a neural network to simulate linear regression).

As previously noted, however, I'd prefer to eliminate training as a separate step in any case.

Prefer initialization over assignment

Consider your constructor:

simple_linear_regression::simple_linear_regression(std::vector<double> X, std::vector<double> y, bool verbose)
{
 this->X = X;
 this->y = y;
 this->verbose = verbose;
}

Rather than assigning values inside the body of the ctor, I'd prefer to use a member initializer list:

simple_linear_regression::simple_linear_regression(std::vector<double> X,
 std::vector<double> y, 
 bool verbose)
 : X(X)
 , Y(Y)
 , verbose(verbose)
{}

Stack Exchange Network

Simple Linear Regression in C++

4 Answers 4

Comments

Type matching

Interface Design

Avoid Boolean parameters

Naming

Prefer initialization over assignment

How about the numerical type other than `double`?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Simple Linear Regression in C++

4 Answers 4

Comments

Type matching

Interface Design

Avoid Boolean parameters

Naming

Prefer initialization over assignment

How about the numerical type other than double?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

How about the numerical type other than `double`?