Univariate linear regression from scratch in Python

Question 1

I am relatively new to machine learning and I believe one of the best ways for me to get the intuition behind most algorithms is to write them from scratch before using tons of external libraries.

This classifier I wrote seems to be yielding reasonable results based on the dataset I provided. This dataset is based on the number of hours that a student studied for a test (x), and the score this same student got in the test (y).

I tried to exploit OOP as much as I could, instead of using a procedural approach to write the algorithm.

Would you mind giving me your opinions and comments about this code? This is also important because I'll be adding to my portfolio. Are there some missing good practices in the code? What would you recommend keeping and removing in a professional setting or for life as a developer?

Univariate linear regression algorithm:

# Linear equation based on: y = m * x + b, which is the same as h = theta1 * x + theta0
import numpy as np
class LinearRegressionModel():
 """
 Univariate linear regression model classifier.
 """
 def __init__(self, dataset, learning_rate, num_iterations):
 """
 Class constructor.
 """
 self.dataset = np.array(dataset)
 self.b = 0 # Initial guess value for 'b'.
 self.m = 0 # Initial guess value for 'm'.
 self.learning_rate = learning_rate
 self.num_iterations = num_iterations
 self.M = len(self.dataset) # 100.
 self.total_error = 0
 def apply_gradient_descent(self):
 """
 Runs the gradient descent step 'num_iterations' times.
 """
 for i in range(self.num_iterations):
 self.do_gradient_step()
 def do_gradient_step(self):
 """
 Performs each step of gradient descent, tweaking 'b' and 'm'.
 """
 b_summation = 0
 m_summation = 0
 # Doing the summation here.
 for i in range(self.M):
 x_value = self.dataset[i, 0]
 y_value = self.dataset[i, 1]
 b_summation += (((self.m * x_value) + self.b) - y_value) # * 1
 m_summation += (((self.m * x_value) + self.b) - y_value) * x_value
 # Updating parameter values 'b' and 'm'.
 self.b = self.b - (self.learning_rate * (1/self.M) * b_summation)
 self.m = self.m - (self.learning_rate * (1/self.M) * m_summation)
 # At this point. Gradient descent is finished.
 def compute_error(self):
 """
 Computes the total error based on the linear regression cost function.
 """
 for i in range(self.M):
 x_value = self.dataset[i, 0]
 y_value = self.dataset[i, 1]
 self.total_error += ((self.m * x_value) + self.b) - y_value
 return self.total_error
 def __str__(self):
 return "Results: b: {}, m: {}, Final Total error: {}".format(round(self.b, 2), round(self.m, 2), round(self.compute_error(), 2))
 def get_prediction_based_on(self, x):
 return round(float((self.m * x) + self.b), 2) # Type: Numpy float.
def main():
 # Loading dataset.
 school_dataset = np.genfromtxt(DATASET_PATH, delimiter=",")
 # Creating 'LinearRegressionModel' object.
 lr = LinearRegressionModel(school_dataset, 0.0001, 1000)
 # Applying gradient descent.
 lr.apply_gradient_descent()
 # Getting some predictions.
 hours = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
 for hour in hours:
 print("Studied {} hours and got {} points.".format(hour, lr.get_prediction_based_on(hour)))
 # Printing the class attribute values.
 print(lr)
if __name__ == "__main__": main()

Dataset snippet:

32.502345269453031,31.70700584656992
53.426804033275019,68.77759598163891
61.530358025636438,62.562382297945803
47.475639634786098,71.546632233567777
59.813207869512318,87.230925133687393
55.142188413943821,78.211518270799232
52.550014442733818,71.300879886850353
45.419730144973755,55.165677145959123

Question 2

About OOP

I tried to exploit OOP as much as I could, instead of using a procedural approach to write the algorithm.

Although I believe that your approach was fine, using OOP for the sake of OOP is something I would rather warn against. There is a talk about this here.

Comments

def __init__(self, dataset, learning_rate, num_iterations):
 """
 Class constructor.
 """

The comment Class constructor is redundant and unecessary, I would instead explain the parameters of __init__ in the doc string.

 self.M = len(self.dataset) # 100.

Is the # 100 saying that the len(self.dataset) is always going to be 100? It might be 100 in this case, but I highly doubt you can ensure that.

Default values

Have you considered putting default values for learning_rate and num_iterations? If we want a default of 100 and 0.001 for num_iterations and learning_rate respectively, you could rewrite __init__ like:

def __init__(self, dataset, learning_rate=0.001, num_iterations=100):

Private methods

Do you really want do_gradient_step(self) to be considered public? Yes, there are no "true" private methods, but the convention is to put one underscore before the name to indicate it is private. Honestly, I would just call it _step(self).

Indentation

if __name__ == "__main__": main()

should really be:

if __name__ == "__main__": 
 main()

To comply with PEP 8.

Question 3

to comply with pep8 there should just be one underscore before an private method or variable.

Question 4

@baot Thanks for the heads up. Changed it.

Question 5

Are there some missing good practices in the code?

Notes about training methods for Linear Regression.

Gradient Descent is slower but uses less memory.
Normal equations as shown below is faster but uses more memory.

Training member function

You did well in trying to use gradient descent to train a linear model. For most models like the logistic regression model: there is no actual solution to train the model. However, for the linear regression model with squared errors you can calculate the weights with the below equation. enter image description here

You can just add this method to the class with your other training functions (This is a head start on how you could implement the equation.)

def train_squared_error():
 x_value = np.array([x[i, 0] for x in self.dataset])
 y_value = np.array([y[i, 1] for y in self.dataset])
 self.m_b = (np.transpose(x) @ x).inverse() @ np.transpose(x) @ y

Note that this is going to be faster than gradient descent because matrix multiplication like this using numpy commands are very quick. Also the @ symbol is operator overload for the .dot() method for dot product command. I recommend testing this function because I did this off of the top of my head and don't have time to check if it is 100%.

Testing Suite

https://docs.python.org/3/library/unittest.html

I also recommend testing the class extensively by creating a unit test class like below:

import unittest
class TestStringMethods(unittest.TestCase):
 def test_upper(self):
 self.assertEqual('foo'.upper(), 'FOO')
 def test_isupper(self):
 self.assertTrue('FOO'.isupper())
 self.assertFalse('Foo'.isupper())
 def test_split(self):
 s = 'hello world'
 self.assertEqual(s.split(), ['hello', 'world'])
 # check that s.split fails when the separator is not a string
 with self.assertRaises(TypeError):
 s.split(2)
if __name__ == '__main__':
 unittest.main()

Dair Dair 6,1901 gold badge21 silver badges45 bronze badges · Accepted Answer · 2018-06-19 19:36:58Z

About OOP

I tried to exploit OOP as much as I could, instead of using a procedural approach to write the algorithm.

Although I believe that your approach was fine, using OOP for the sake of OOP is something I would rather warn against. There is a talk about this here.

Comments

def __init__(self, dataset, learning_rate, num_iterations):
 """
 Class constructor.
 """

The comment Class constructor is redundant and unecessary, I would instead explain the parameters of __init__ in the doc string.

 self.M = len(self.dataset) # 100.

Is the # 100 saying that the len(self.dataset) is always going to be 100? It might be 100 in this case, but I highly doubt you can ensure that.

Default values

Have you considered putting default values for learning_rate and num_iterations? If we want a default of 100 and 0.001 for num_iterations and learning_rate respectively, you could rewrite __init__ like:

def __init__(self, dataset, learning_rate=0.001, num_iterations=100):

Private methods

Do you really want do_gradient_step(self) to be considered public? Yes, there are no "true" private methods, but the convention is to put one underscore before the name to indicate it is private. Honestly, I would just call it _step(self).

Indentation

if __name__ == "__main__": main()

should really be:

if __name__ == "__main__": 
 main()

To comply with PEP 8.

to comply with pep8 there should just be one underscore before an private method or variable.

Stack Exchange Network

Univariate linear regression from scratch in Python

2 Answers 2

About OOP

Comments

Default values

Private methods

Indentation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Univariate linear regression from scratch in Python

2 Answers 2

About OOP

Comments

Default values

Private methods

Indentation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions