1
\$\begingroup\$

I wrote this code for an assignment. It was originally meant to be written in Maple, but got very frustrated with some of Maple's idiosyncrasies that I decided to play around with Pandas instead. This is a very trivial multi-linear regression model, which calculates variable weights using least-squares optimisation, and also allows for basic forward selection and backward elimination for model refinement (both without any form of backtracking). All suggestions welcome.

import pandas as pd
from numpy import dot, mean, sqrt
from numpy.linalg import inv
def _weights(X, Y):
 # Least squares solution for the w that minimises
 # abs(Y - dot(X,w))
 # In newer Python and Numpy, the following
 # abombination filled lines will become the much nicer:
 # weights = (inv(X.T @ X) @ X.T) @ Y
 # return pd.Series(weights, index=X.columns)
 abomination = inv(dot(X.T, X))
 abomination = dot(abomination, X.T)
 abomination = dot(abomination, Y)
 return pd.Series(abomination, index=X.columns)
class LinearRegression:
 '''
 (multi)linear regression model using least-squares error
 minimisation. The weights calculated for each variable are 
 available in the Series self.weights, whose labels are aligned
 to the columns of X; the constant coefficient has the label ''.
 '''
 def __init__(self, X, Y):
 '''
 X: a pandas DataFrame of the independant variables
 Y: a Series of the single dependent variable
 '''
 self.X = X
 self.observed_Y = Y
 if not self.vars:
 # No indepedent vars => every Y is equal
 # (simple linear model with gradient = 0)
 intercept = mean(Y)
 self.weights = pd.Series([intercept], index=[''])
 self.fitted_Y = pd.Series(intercept, index=Y.index)
 else:
 # Augment the X with a column of 1s at the left,
 # Then the weights will come back with a
 # constant coefficient at the top.
 ones_column = pd.DataFrame({'':1}, index=X.index)
 augmented_X = ones_column.join(X)
 self.weights = _weights(augmented_X, Y)
 self.fitted_Y = augmented_X.dot(self.weights)
 @classmethod
 def empty(cls, Y):
 '''
 Create a model with the given observations for the 
 dependent variable and *no* independent variables.
 '''
 X = pd.DataFrame([], index=Y.index)
 return cls(X, Y)
 @property
 def vars(self):
 # Needs to be a list so that, eg, `if self.vars:`
 # is a test for the empty model. If this was 
 # instead a Pandas Index, that would be an error.
 return list(self.X.columns)
 def backward_elimination(self, threshold):
 '''
 Simplify the model by the method of backward
 elimination.
 Drop columns if one at a time, choosing that column
 with the least impact on the model's RMSE, but only
 if that impact is within than `threshold`.
 '''
 Y = self.observed_Y
 overall_best = self
 def impact(m):
 return abs(m.rmse - overall_best.rmse)
 while overall_best.vars:
 X = overall_best.X
 candidates = (type(self)(X.drop(i, axis=1), Y) 
 for i in overall_best.vars)
 best = min(candidates, key=impact)
 if impact(best) < threshold:
 overall_best = best
 else:
 break
 return overall_best
 def forward_selection(self, threshold):
 '''
 Improve the model by the method of forward selection.
 Starting with the empty model, progressively add one column
 at a time, choosing the one with the best improvement to RMSE
 over, but only if that improvement is at least `threshold`.
 '''
 Y = self.observed_Y
 overall_best = type(self).empty(Y)
 def improvement(m):
 return overall_best.rmse - m.rmse
 while len(overall_best.vars) < len(self.vars):
 X = overall_best.X
 candidates = (type(self)(X.join(self.X[i]), Y)
 for i in self.vars
 if i not in overall_best.vars)
 best = max(candidates, key=improvement)
 if improvement(best) >= threshold:
 overall_best = best
 else:
 break
 return overall_best
 @property
 def residuals(self):
 return self.observed_Y - self.fitted_Y
 @property
 def rmse(self):
 return sqrt(mean(self.residuals**2))
 def __str__(self):
 y = self.observed_Y.name
 intercept = "{:.3f}".format(self.weights[''])
 xs = ('{:=+7.3f}*{}'.format(self.weights[n], n)
 for n in self.vars)
 return '{} = {} {}'.format(y, intercept, ' '.join(xs))
asked Oct 5, 2015 at 14:52
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

Two things confuse me about _weights. One, why is there an underscore before the name? That's usually a convention for indicating that a variable or method are ostensibly private and shouldn't be used externally. Which brings me to my second point, why is this created as a standalone function when it's use is tied to the LinearRegression class? Solve both problems and put _weights inside LinearRegression.

answered Oct 5, 2015 at 15:14
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.