Multi-linear regression with trivial model refinement algorithms

Question 1

I wrote this code for an assignment. It was originally meant to be written in Maple, but got very frustrated with some of Maple's idiosyncrasies that I decided to play around with Pandas instead. This is a very trivial multi-linear regression model, which calculates variable weights using least-squares optimisation, and also allows for basic forward selection and backward elimination for model refinement (both without any form of backtracking). All suggestions welcome.

import pandas as pd
from numpy import dot, mean, sqrt
from numpy.linalg import inv
def _weights(X, Y):
 # Least squares solution for the w that minimises
 # abs(Y - dot(X,w))
 # In newer Python and Numpy, the following
 # abombination filled lines will become the much nicer:
 # weights = (inv(X.T @ X) @ X.T) @ Y
 # return pd.Series(weights, index=X.columns)
 abomination = inv(dot(X.T, X))
 abomination = dot(abomination, X.T)
 abomination = dot(abomination, Y)
 return pd.Series(abomination, index=X.columns)
class LinearRegression:
 '''
 (multi)linear regression model using least-squares error
 minimisation. The weights calculated for each variable are 
 available in the Series self.weights, whose labels are aligned
 to the columns of X; the constant coefficient has the label ''.
 '''
 def __init__(self, X, Y):
 '''
 X: a pandas DataFrame of the independant variables
 Y: a Series of the single dependent variable
 '''
 self.X = X
 self.observed_Y = Y
 if not self.vars:
 # No indepedent vars => every Y is equal
 # (simple linear model with gradient = 0)
 intercept = mean(Y)
 self.weights = pd.Series([intercept], index=[''])
 self.fitted_Y = pd.Series(intercept, index=Y.index)
 else:
 # Augment the X with a column of 1s at the left,
 # Then the weights will come back with a
 # constant coefficient at the top.
 ones_column = pd.DataFrame({'':1}, index=X.index)
 augmented_X = ones_column.join(X)
 self.weights = _weights(augmented_X, Y)
 self.fitted_Y = augmented_X.dot(self.weights)
 @classmethod
 def empty(cls, Y):
 '''
 Create a model with the given observations for the 
 dependent variable and *no* independent variables.
 '''
 X = pd.DataFrame([], index=Y.index)
 return cls(X, Y)
 @property
 def vars(self):
 # Needs to be a list so that, eg, `if self.vars:`
 # is a test for the empty model. If this was 
 # instead a Pandas Index, that would be an error.
 return list(self.X.columns)
 def backward_elimination(self, threshold):
 '''
 Simplify the model by the method of backward
 elimination.
 Drop columns if one at a time, choosing that column
 with the least impact on the model's RMSE, but only
 if that impact is within than `threshold`.
 '''
 Y = self.observed_Y
 overall_best = self
 def impact(m):
 return abs(m.rmse - overall_best.rmse)
 while overall_best.vars:
 X = overall_best.X
 candidates = (type(self)(X.drop(i, axis=1), Y) 
 for i in overall_best.vars)
 best = min(candidates, key=impact)
 if impact(best) < threshold:
 overall_best = best
 else:
 break
 return overall_best
 def forward_selection(self, threshold):
 '''
 Improve the model by the method of forward selection.
 Starting with the empty model, progressively add one column
 at a time, choosing the one with the best improvement to RMSE
 over, but only if that improvement is at least `threshold`.
 '''
 Y = self.observed_Y
 overall_best = type(self).empty(Y)
 def improvement(m):
 return overall_best.rmse - m.rmse
 while len(overall_best.vars) < len(self.vars):
 X = overall_best.X
 candidates = (type(self)(X.join(self.X[i]), Y)
 for i in self.vars
 if i not in overall_best.vars)
 best = max(candidates, key=improvement)
 if improvement(best) >= threshold:
 overall_best = best
 else:
 break
 return overall_best
 @property
 def residuals(self):
 return self.observed_Y - self.fitted_Y
 @property
 def rmse(self):
 return sqrt(mean(self.residuals**2))
 def __str__(self):
 y = self.observed_Y.name
 intercept = "{:.3f}".format(self.weights[''])
 xs = ('{:=+7.3f}*{}'.format(self.weights[n], n)
 for n in self.vars)
 return '{} = {} {}'.format(y, intercept, ' '.join(xs))

Question 2

Two things confuse me about _weights. One, why is there an underscore before the name? That's usually a convention for indicating that a variable or method are ostensibly private and shouldn't be used externally. Which brings me to my second point, why is this created as a standalone function when it's use is tied to the LinearRegression class? Solve both problems and put _weights inside LinearRegression.

score 1 · Answer 1 · 2015-10-05 15:14:42Z

Two things confuse me about _weights. One, why is there an underscore before the name? That's usually a convention for indicating that a variable or method are ostensibly private and shouldn't be used externally. Which brings me to my second point, why is this created as a standalone function when it's use is tied to the LinearRegression class? Solve both problems and put _weights inside LinearRegression.

Stack Exchange Network

Multi-linear regression with trivial model refinement algorithms

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Multi-linear regression with trivial model refinement algorithms

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions