I have recently been brushing up on my statistics and calculus and wanted to implement Linear Regression, the code will be for a calculus/statistics library I am working on (I know there are libraries for this but I am trying improve my both coding and math skills).
The following code works as intended. However, I feel like it has some structural flaws and I can't put my finger on what it is
class LinearReg:
x = []
y = []
x_mean = 0
y_mean = 0
b_zero = 0
slope = 0
def __init__(self,x,y):
if len(x) != len(y):
raise Error("Both axis must have the same number of values")
self.x = x
self.y = y
self.x_mean = self.axis_mean(self.x)
self.y_mean = self.axis_mean(self.y)
def axis_mean(self,axis):
return sum(axis) / len(axis)
def sum_of_deviation_products(self):
result = 0
for i in range(len(self.y)):
x_dev = (self.x[i] - self.x_mean)
y_dev = (self.y[i] - self.y_mean)
result += x_dev * y_dev
return result
def sum_of_x_deviation_squared(self):
result = 0
for i in range(len(self.x)):
result += (self.x[i] - self.x_mean)**2
return result
def get_b_zero(self):
return self.b_zero
def get_slope(self):
self.slope = self.sum_of_deviation_products() / self.sum_of_x_deviation_squared()
return self.slope
def fit_best_line(self):
self.b_zero = self.y_mean - (self.get_slope() * self.x_mean)
print("The best line equation is: " +
"%.2f" % self.slope
+ "x + " +
"%.2f" % self.b_zero)
My primary focus is now: structure and cleanliness,
- How efficient would this play out in a large library?
- How clean and readable is the code?
- What kind of problems would I face if this class were to interact with other classes in the library?
Perhaps there is quicker way to do Linear Regression in Theory, but I'm not really interested in that right now.
3 Answers 3
Toward optimized functionality and design
since all crucial attributes
x
,y
,x_mean
,y_mean
are initialized onLinearReg
instance creation (within__init__
method) - no need to duplicate and define them as class attributes (x = [] ... y = []
), those should be removed as redundantraise Error
-Error
is not Python exception class. UseException
orValueError
axis_mean(self, axis)
function does not useself
context and deserves to be just a@staticmethod
sum_of_deviation_products/sum_of_x_deviation_squared
functions
Substitute algorithm: instead of going with external variableresult
+range(len(..))
+for
loop - apply a convenient combination ofsum
+enumerate
functions:def sum_of_deviation_products(self): return sum((self.x[i] - self.x_mean) * (y - self.y_mean) for i, y in enumerate(self.y))
A good (or better) alternative would be
sum
+zip
approach:sum((x - self.x_mean) * (y - self.y_mean) for x, y in zip(self.x, self.y))
get_b_zero
method is redundant as it's just return a public attributeself.b_zero
(which is accessed directly)get_slope
method
Instead of storing and reassigningself.slope
attribute on each method invocation - as it's recalculated each time, it deserves to be a computed attribute using@property
decorator:@property def slope(self): return self.sum_of_deviation_products() / self.sum_of_x_deviation_squared()
Now, it's simply accessed with
self.slope
.fit_best_line
method
Instead of unreadable string concatenation like... + "%.2f" % self.slope + "x + " + "%.2f" % self.b_zero
use flexiblef-string
formatting:print(f"The best line equation is: {self.slope:.2f}x + {self.b_zero:.2f}")
.The conciseness and flexibility are obvious.
Also, in case ifb_zero
attribute/property happen to be only actual in context offit_best_line
call - it can be eliminated and declared as just local variable.
The final optimized approach:
class LinearReg:
def __init__(self, x, y):
if len(x) != len(y):
raise ValueError("Both axis must have the same number of values")
self.x = x
self.y = y
self.x_mean = self.axis_mean(self.x)
self.y_mean = self.axis_mean(self.y)
self.b_zero = 0
@staticmethod
def axis_mean(axis):
return sum(axis) / len(axis)
def sum_of_deviation_products(self):
return sum((self.x[i] - self.x_mean) * (y - self.y_mean)
for i, y in enumerate(self.y))
def sum_of_x_deviation_squared(self):
return sum((x - self.x_mean) ** 2
for x in self.x)
@property
def slope(self):
return self.sum_of_deviation_products() / self.sum_of_x_deviation_squared()
def fit_best_line(self):
self.b_zero = self.y_mean - (self.slope * self.x_mean)
print(f"The best line equation is: {self.slope:.2f}x + {self.b_zero:.2f}")
-
\$\begingroup\$ Loop like a native, and avoid indexing. Using
sum((x - self.x_mean) * (y - self.y_mean) for x, y in zip(self.x, self.y))
would be cleaner and faster. Ditto forsum((x - self.x_mean) ** 2 for x in self.x)
. \$\endgroup\$AJNeufeld– AJNeufeld2020年02月04日 23:18:27 +00:00Commented Feb 4, 2020 at 23:18 -
\$\begingroup\$ What's the use of adding
@property
in this case? Also would it be reasonable putself.slope = self.sum_of_deviation_products() / self.sum_of_x_deviation_squared()
in the constructor? Also same goes for axis_mean. \$\endgroup\$Doğukan Özdemir– Doğukan Özdemir2020年02月05日 07:03:18 +00:00Commented Feb 5, 2020 at 7:03 -
\$\begingroup\$ @AJNeufeld, thanks for the hint, was writing a quick review at late night (could miss something) \$\endgroup\$RomanPerekhrest– RomanPerekhrest2020年02月05日 07:59:09 +00:00Commented Feb 5, 2020 at 7:59
Consistency
Your class does not seem consistent with itself. If you create a LinearReg
object:
x = [1, 2, 3, 4, 5]
y = [2, 3, 2.5, 4, 3.5]
lr = LinearReg(x, y)
The constructor computes the mean of the x
and y
lists, and stores these for later computations. These means are fixed; you cannot add additional data points, as the mean will not be recomputed.
If you call lr.get_b_zero()
, you will return the value 0
, since the b_zero
member has not been computed.
Each time you call lr.get_slope()
, it recomputes the slope and stores the result. If you have changed the x
and y
lists, the computed value will be different ... and incorrect since the mean values have not changed.
Each time you call fit_best_line()
, it calls get_slope()
which recomputes the slope (possibly incorrectly if the data has changed), and uses the result to compute (or recompute) and store the b_zero
value. And as a side effect, it prints the line equation.
If you want to use this class to get the best fit straight line, and use the slope & intercept in other calculations, you must:
- Create the
LinearReg
object - Call
fit_best_line()
to compute the b_zero- Ignore the spurious, unnecessary print output
- Call
get_slope()
(unnecessarily recomputes the slope) - Call
get_b_zero()
. - Use the return values.
I'm certain you'd agree this is somewhat awkward. It would not work well in a large library.
Stop Writing Classes
See the YouTube video Stop Writing Classes for more detail.
You don't need a class for this; just a single function. This function would:
- Compute the means
- Compute the slope
- Compute the b_zero
- Return the above
A class is unnecessary, because everything can be done in this one simple function call. No state needs to be maintained.
from collections import namedtuple
LinearReg = namedtuple('LinearReg', 'x_mean, y_mean, slope, b_zero')
def linear_regression(x, y):
if len(x) != len(y):
raise ValueError("Both axis must have the same number of values")
x_mean = sum(x) / len(x)
y_mean = sum(y) / len(y)
sum_x2_dev = sum((xi - x_mean) ** 2 for xi in x)
sum_xy_dev = sum((xi - x_mean) * (yi - y_mean) for xi, yi in zip(x, y))
slope = sum_xy_dev / sum_x2_dev
b_zero = y_mean - slope * x_mean
return LinearReg(x_mean, y_mean, slope, b_zero)
Usage:
>>> lr = linear_regression([1, 2, 3, 4, 5], [2, 3, 2.5, 4, 3.5])
>>> lr.slope
0.4
>>> lr.b_zero
1.7999999999999998
We can slightly beefed up the namedtuple
, adding some convenience methods to it:
class LinearReg(namedtuple('LinearReg', 'x_mean, y_mean, slope, b_zero')):
def __str__(self):
return f"The best line equation is: {self.slope:.2f}x + {self.b_zero:.2f}"
def __call__(self, x):
"""
Interpolation/extrapolation: return y-value for a given x-value
"""
return self.slope * x + self.b_zero
And then we'd additionally have:
>>> str(lr)
'The best line equation is: 0.40x + 1.80'
>>> lr(6)
4.2
A Use Case for a Class
If your LinearReg
object allowed additional data points to be added, then you would have "state" which could be maintained in your object. And then it would make sense to recompute the slope and b_zero values when requested.
-
\$\begingroup\$ I get your point on not writing a class for this purpose, however I feel like when creating a function for this purpose, it hurts the readablity (not saying my class is super readable but I feel like it's more clear to see what's going on). \$\endgroup\$Doğukan Özdemir– Doğukan Özdemir2020年02月11日 10:16:05 +00:00Commented Feb 11, 2020 at 10:16
Some good points have been said, I'm not going over those again. Just wanted to say that your code is probably meant to be used by someone else. Even if it's not, your API should reflect what your code will be working on. I would not take 2 arrays for input, but an array of 2d coordinates. Then, if for your computation it's easier to split those coordinates in 2 arrays, do this inside your code.
My 2 cents.
Explore related questions
See similar questions with these tags.