Linear Regression Class in Python

Question 1

I have recently been brushing up on my statistics and calculus and wanted to implement Linear Regression, the code will be for a calculus/statistics library I am working on (I know there are libraries for this but I am trying improve my both coding and math skills).

The following code works as intended. However, I feel like it has some structural flaws and I can't put my finger on what it is

class LinearReg:
 x = []
 y = []
 x_mean = 0
 y_mean = 0 
 b_zero = 0
 slope = 0
 def __init__(self,x,y):
 if len(x) != len(y):
 raise Error("Both axis must have the same number of values")
 self.x = x
 self.y = y
 self.x_mean = self.axis_mean(self.x)
 self.y_mean = self.axis_mean(self.y)
 def axis_mean(self,axis):
 return sum(axis) / len(axis)
 def sum_of_deviation_products(self):
 result = 0
 for i in range(len(self.y)):
 x_dev = (self.x[i] - self.x_mean)
 y_dev = (self.y[i] - self.y_mean)
 result += x_dev * y_dev
 return result
 def sum_of_x_deviation_squared(self):
 result = 0
 for i in range(len(self.x)):
 result += (self.x[i] - self.x_mean)**2
 return result
 def get_b_zero(self):
 return self.b_zero
 def get_slope(self):
 self.slope = self.sum_of_deviation_products() / self.sum_of_x_deviation_squared()
 return self.slope
 def fit_best_line(self):
 self.b_zero = self.y_mean - (self.get_slope() * self.x_mean)
 print("The best line equation is: " + 
 "%.2f" % self.slope
 + "x + " + 
 "%.2f" % self.b_zero)

My primary focus is now: structure and cleanliness,

How efficient would this play out in a large library?
How clean and readable is the code?
What kind of problems would I face if this class were to interact with other classes in the library?

Perhaps there is quicker way to do Linear Regression in Theory, but I'm not really interested in that right now.

Question 2

Toward optimized functionality and design

since all crucial attributes x, y, x_mean, y_mean are initialized on LinearReg instance creation (within __init__ method) - no need to duplicate and define them as class attributes (x = [] ... y = []), those should be removed as redundant
raise Error - Error is not Python exception class. Use Exception or ValueError
axis_mean(self, axis) function does not use self context and deserves to be just a @staticmethod
sum_of_deviation_products/sum_of_x_deviation_squared functions
Substitute algorithm: instead of going with external variable result + range(len(..)) + for loop - apply a convenient combination of sum + enumerate functions:
```
def sum_of_deviation_products(self):
 return sum((self.x[i] - self.x_mean) * (y - self.y_mean)
 for i, y in enumerate(self.y))
```
A good (or better) alternative would be sum + zip approach: sum((x - self.x_mean) * (y - self.y_mean) for x, y in zip(self.x, self.y))
get_b_zero method is redundant as it's just return a public attribute self.b_zero (which is accessed directly)
get_slope method
Instead of storing and reassigning self.slope attribute on each method invocation - as it's recalculated each time, it deserves to be a computed attribute using @property decorator:
```
@property
def slope(self):
 return self.sum_of_deviation_products() / self.sum_of_x_deviation_squared()
```
Now, it's simply accessed with self.slope.
fit_best_line method
Instead of unreadable string concatenation like ... + "%.2f" % self.slope + "x + " + "%.2f" % self.b_zero use flexible f-string formatting:

print(f"The best line equation is: {self.slope:.2f}x + {self.b_zero:.2f}").

The conciseness and flexibility are obvious.
Also, in case if b_zero attribute/property happen to be only actual in context of fit_best_line call - it can be eliminated and declared as just local variable.

The final optimized approach:

class LinearReg:
 def __init__(self, x, y):
 if len(x) != len(y):
 raise ValueError("Both axis must have the same number of values")
 self.x = x
 self.y = y
 self.x_mean = self.axis_mean(self.x)
 self.y_mean = self.axis_mean(self.y)
 self.b_zero = 0
 @staticmethod
 def axis_mean(axis):
 return sum(axis) / len(axis)
 def sum_of_deviation_products(self):
 return sum((self.x[i] - self.x_mean) * (y - self.y_mean)
 for i, y in enumerate(self.y))
 def sum_of_x_deviation_squared(self):
 return sum((x - self.x_mean) ** 2
 for x in self.x)
 @property
 def slope(self):
 return self.sum_of_deviation_products() / self.sum_of_x_deviation_squared()
 def fit_best_line(self):
 self.b_zero = self.y_mean - (self.slope * self.x_mean)
 print(f"The best line equation is: {self.slope:.2f}x + {self.b_zero:.2f}")

Question 3

Loop like a native, and avoid indexing. Using sum((x - self.x_mean) * (y - self.y_mean) for x, y in zip(self.x, self.y)) would be cleaner and faster. Ditto for sum((x - self.x_mean) ** 2 for x in self.x).

Question 4

What's the use of adding @property in this case? Also would it be reasonable put self.slope = self.sum_of_deviation_products() / self.sum_of_x_deviation_squared() in the constructor? Also same goes for axis_mean.

Question 5

@AJNeufeld, thanks for the hint, was writing a quick review at late night (could miss something)

Question 6

Consistency

Your class does not seem consistent with itself. If you create a LinearReg object:

x = [1, 2, 3, 4, 5]
y = [2, 3, 2.5, 4, 3.5]
lr = LinearReg(x, y)

The constructor computes the mean of the x and y lists, and stores these for later computations. These means are fixed; you cannot add additional data points, as the mean will not be recomputed.

If you call lr.get_b_zero(), you will return the value 0, since the b_zero member has not been computed.

Each time you call lr.get_slope(), it recomputes the slope and stores the result. If you have changed the x and y lists, the computed value will be different ... and incorrect since the mean values have not changed.

Each time you call fit_best_line(), it calls get_slope() which recomputes the slope (possibly incorrectly if the data has changed), and uses the result to compute (or recompute) and store the b_zero value. And as a side effect, it prints the line equation.

If you want to use this class to get the best fit straight line, and use the slope & intercept in other calculations, you must:

Create the LinearReg object
Call fit_best_line() to compute the b_zero
- Ignore the spurious, unnecessary print output
Call get_slope() (unnecessarily recomputes the slope)
Call get_b_zero().
Use the return values.

I'm certain you'd agree this is somewhat awkward. It would not work well in a large library.

Stop Writing Classes

See the YouTube video Stop Writing Classes for more detail.

You don't need a class for this; just a single function. This function would:

Compute the means
Compute the slope
Compute the b_zero
Return the above

A class is unnecessary, because everything can be done in this one simple function call. No state needs to be maintained.

from collections import namedtuple
LinearReg = namedtuple('LinearReg', 'x_mean, y_mean, slope, b_zero')
def linear_regression(x, y):
 if len(x) != len(y):
 raise ValueError("Both axis must have the same number of values")
 x_mean = sum(x) / len(x)
 y_mean = sum(y) / len(y)
 sum_x2_dev = sum((xi - x_mean) ** 2 for xi in x)
 sum_xy_dev = sum((xi - x_mean) * (yi - y_mean) for xi, yi in zip(x, y))
 slope = sum_xy_dev / sum_x2_dev
 b_zero = y_mean - slope * x_mean
 return LinearReg(x_mean, y_mean, slope, b_zero)

Usage:

>>> lr = linear_regression([1, 2, 3, 4, 5], [2, 3, 2.5, 4, 3.5])
>>> lr.slope
0.4
>>> lr.b_zero
1.7999999999999998

We can slightly beefed up the namedtuple, adding some convenience methods to it:

class LinearReg(namedtuple('LinearReg', 'x_mean, y_mean, slope, b_zero')):
 def __str__(self):
 return f"The best line equation is: {self.slope:.2f}x + {self.b_zero:.2f}"
 def __call__(self, x):
 """
 Interpolation/extrapolation: return y-value for a given x-value
 """
 return self.slope * x + self.b_zero

And then we'd additionally have:

>>> str(lr)
'The best line equation is: 0.40x + 1.80'
>>> lr(6)
4.2

A Use Case for a Class

If your LinearReg object allowed additional data points to be added, then you would have "state" which could be maintained in your object. And then it would make sense to recompute the slope and b_zero values when requested.

Question 7

I get your point on not writing a class for this purpose, however I feel like when creating a function for this purpose, it hurts the readablity (not saying my class is super readable but I feel like it's more clear to see what's going on).

Question 8

Some good points have been said, I'm not going over those again. Just wanted to say that your code is probably meant to be used by someone else. Even if it's not, your API should reflect what your code will be working on. I would not take 2 arrays for input, but an array of 2d coordinates. Then, if for your computation it's easier to split those coordinates in 2 arrays, do this inside your code.

My 2 cents.

score 5 · Answer 1 · 2020-02-04 21:48:18Z

Toward optimized functionality and design

since all crucial attributes x, y, x_mean, y_mean are initialized on LinearReg instance creation (within __init__ method) - no need to duplicate and define them as class attributes (x = [] ... y = []), those should be removed as redundant
raise Error - Error is not Python exception class. Use Exception or ValueError
axis_mean(self, axis) function does not use self context and deserves to be just a @staticmethod
sum_of_deviation_products/sum_of_x_deviation_squared functions
Substitute algorithm: instead of going with external variable result + range(len(..)) + for loop - apply a convenient combination of sum + enumerate functions:
```
def sum_of_deviation_products(self):
 return sum((self.x[i] - self.x_mean) * (y - self.y_mean)
 for i, y in enumerate(self.y))
```
A good (or better) alternative would be sum + zip approach: sum((x - self.x_mean) * (y - self.y_mean) for x, y in zip(self.x, self.y))
get_b_zero method is redundant as it's just return a public attribute self.b_zero (which is accessed directly)
get_slope method
Instead of storing and reassigning self.slope attribute on each method invocation - as it's recalculated each time, it deserves to be a computed attribute using @property decorator:
```
@property
def slope(self):
 return self.sum_of_deviation_products() / self.sum_of_x_deviation_squared()
```
Now, it's simply accessed with self.slope.
fit_best_line method
Instead of unreadable string concatenation like ... + "%.2f" % self.slope + "x + " + "%.2f" % self.b_zero use flexible f-string formatting:

print(f"The best line equation is: {self.slope:.2f}x + {self.b_zero:.2f}").

The conciseness and flexibility are obvious.
Also, in case if b_zero attribute/property happen to be only actual in context of fit_best_line call - it can be eliminated and declared as just local variable.

The final optimized approach:

class LinearReg:
 def __init__(self, x, y):
 if len(x) != len(y):
 raise ValueError("Both axis must have the same number of values")
 self.x = x
 self.y = y
 self.x_mean = self.axis_mean(self.x)
 self.y_mean = self.axis_mean(self.y)
 self.b_zero = 0
 @staticmethod
 def axis_mean(axis):
 return sum(axis) / len(axis)
 def sum_of_deviation_products(self):
 return sum((self.x[i] - self.x_mean) * (y - self.y_mean)
 for i, y in enumerate(self.y))
 def sum_of_x_deviation_squared(self):
 return sum((x - self.x_mean) ** 2
 for x in self.x)
 @property
 def slope(self):
 return self.sum_of_deviation_products() / self.sum_of_x_deviation_squared()
 def fit_best_line(self):
 self.b_zero = self.y_mean - (self.slope * self.x_mean)
 print(f"The best line equation is: {self.slope:.2f}x + {self.b_zero:.2f}")

Loop like a native, and avoid indexing. Using sum((x - self.x_mean) * (y - self.y_mean) for x, y in zip(self.x, self.y)) would be cleaner and faster. Ditto for sum((x - self.x_mean) ** 2 for x in self.x).
What's the use of adding @property in this case? Also would it be reasonable put self.slope = self.sum_of_deviation_products() / self.sum_of_x_deviation_squared() in the constructor? Also same goes for axis_mean.
@AJNeufeld, thanks for the hint, was writing a quick review at late night (could miss something)

AJNeufeld AJNeufeld 35.3k5 gold badges41 silver badges103 bronze badges · Answer 2 · 2020-02-05 16:28:12Z

Consistency

Your class does not seem consistent with itself. If you create a LinearReg object:

x = [1, 2, 3, 4, 5]
y = [2, 3, 2.5, 4, 3.5]
lr = LinearReg(x, y)

The constructor computes the mean of the x and y lists, and stores these for later computations. These means are fixed; you cannot add additional data points, as the mean will not be recomputed.

If you call lr.get_b_zero(), you will return the value 0, since the b_zero member has not been computed.

Each time you call lr.get_slope(), it recomputes the slope and stores the result. If you have changed the x and y lists, the computed value will be different ... and incorrect since the mean values have not changed.

Each time you call fit_best_line(), it calls get_slope() which recomputes the slope (possibly incorrectly if the data has changed), and uses the result to compute (or recompute) and store the b_zero value. And as a side effect, it prints the line equation.

If you want to use this class to get the best fit straight line, and use the slope & intercept in other calculations, you must:

Create the LinearReg object
Call fit_best_line() to compute the b_zero
- Ignore the spurious, unnecessary print output
Call get_slope() (unnecessarily recomputes the slope)
Call get_b_zero().
Use the return values.

I'm certain you'd agree this is somewhat awkward. It would not work well in a large library.

Stop Writing Classes

See the YouTube video Stop Writing Classes for more detail.

You don't need a class for this; just a single function. This function would:

Compute the means
Compute the slope
Compute the b_zero
Return the above

A class is unnecessary, because everything can be done in this one simple function call. No state needs to be maintained.

from collections import namedtuple
LinearReg = namedtuple('LinearReg', 'x_mean, y_mean, slope, b_zero')
def linear_regression(x, y):
 if len(x) != len(y):
 raise ValueError("Both axis must have the same number of values")
 x_mean = sum(x) / len(x)
 y_mean = sum(y) / len(y)
 sum_x2_dev = sum((xi - x_mean) ** 2 for xi in x)
 sum_xy_dev = sum((xi - x_mean) * (yi - y_mean) for xi, yi in zip(x, y))
 slope = sum_xy_dev / sum_x2_dev
 b_zero = y_mean - slope * x_mean
 return LinearReg(x_mean, y_mean, slope, b_zero)

Usage:

>>> lr = linear_regression([1, 2, 3, 4, 5], [2, 3, 2.5, 4, 3.5])
>>> lr.slope
0.4
>>> lr.b_zero
1.7999999999999998

We can slightly beefed up the namedtuple, adding some convenience methods to it:

class LinearReg(namedtuple('LinearReg', 'x_mean, y_mean, slope, b_zero')):
 def __str__(self):
 return f"The best line equation is: {self.slope:.2f}x + {self.b_zero:.2f}"
 def __call__(self, x):
 """
 Interpolation/extrapolation: return y-value for a given x-value
 """
 return self.slope * x + self.b_zero

And then we'd additionally have:

>>> str(lr)
'The best line equation is: 0.40x + 1.80'
>>> lr(6)
4.2

A Use Case for a Class

If your LinearReg object allowed additional data points to be added, then you would have "state" which could be maintained in your object. And then it would make sense to recompute the slope and b_zero values when requested.

I get your point on not writing a class for this purpose, however I feel like when creating a function for this purpose, it hurts the readablity (not saying my class is super readable but I feel like it's more clear to see what's going on).

VincentRG VincentRG 1814 bronze badges · Answer 3 · 2020-02-08 06:57:14Z

Some good points have been said, I'm not going over those again. Just wanted to say that your code is probably meant to be used by someone else. Even if it's not, your API should reflect what your code will be working on. I would not take 2 arrays for input, but an array of 2d coordinates. Then, if for your computation it's easier to split those coordinates in 2 arrays, do this inside your code.

My 2 cents.

Stack Exchange Network

Linear Regression Class in Python

3 Answers 3

Toward optimized functionality and design

Consistency

Stop Writing Classes

A Use Case for a Class

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Linear Regression Class in Python

3 Answers 3

Toward optimized functionality and design

Consistency

Stop Writing Classes

A Use Case for a Class

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions