I am relatively new to machine learning and I believe one of the best ways for me to get the intuition behind most algorithms is to write them from scratch before using tons of external libraries.
This classifier I wrote seems to be yielding reasonable results based on the dataset I provided. This dataset is based on the number of hours that a student studied for a test (x), and the score this same student got in the test (y).
I tried to exploit OOP as much as I could, instead of using a procedural approach to write the algorithm.
Would you mind giving me your opinions and comments about this code? This is also important because I'll be adding to my portfolio. Are there some missing good practices in the code? What would you recommend keeping and removing in a professional setting or for life as a developer?
Univariate linear regression algorithm:
# Linear equation based on: y = m * x + b, which is the same as h = theta1 * x + theta0
import numpy as np
class LinearRegressionModel():
"""
Univariate linear regression model classifier.
"""
def __init__(self, dataset, learning_rate, num_iterations):
"""
Class constructor.
"""
self.dataset = np.array(dataset)
self.b = 0 # Initial guess value for 'b'.
self.m = 0 # Initial guess value for 'm'.
self.learning_rate = learning_rate
self.num_iterations = num_iterations
self.M = len(self.dataset) # 100.
self.total_error = 0
def apply_gradient_descent(self):
"""
Runs the gradient descent step 'num_iterations' times.
"""
for i in range(self.num_iterations):
self.do_gradient_step()
def do_gradient_step(self):
"""
Performs each step of gradient descent, tweaking 'b' and 'm'.
"""
b_summation = 0
m_summation = 0
# Doing the summation here.
for i in range(self.M):
x_value = self.dataset[i, 0]
y_value = self.dataset[i, 1]
b_summation += (((self.m * x_value) + self.b) - y_value) # * 1
m_summation += (((self.m * x_value) + self.b) - y_value) * x_value
# Updating parameter values 'b' and 'm'.
self.b = self.b - (self.learning_rate * (1/self.M) * b_summation)
self.m = self.m - (self.learning_rate * (1/self.M) * m_summation)
# At this point. Gradient descent is finished.
def compute_error(self):
"""
Computes the total error based on the linear regression cost function.
"""
for i in range(self.M):
x_value = self.dataset[i, 0]
y_value = self.dataset[i, 1]
self.total_error += ((self.m * x_value) + self.b) - y_value
return self.total_error
def __str__(self):
return "Results: b: {}, m: {}, Final Total error: {}".format(round(self.b, 2), round(self.m, 2), round(self.compute_error(), 2))
def get_prediction_based_on(self, x):
return round(float((self.m * x) + self.b), 2) # Type: Numpy float.
def main():
# Loading dataset.
school_dataset = np.genfromtxt(DATASET_PATH, delimiter=",")
# Creating 'LinearRegressionModel' object.
lr = LinearRegressionModel(school_dataset, 0.0001, 1000)
# Applying gradient descent.
lr.apply_gradient_descent()
# Getting some predictions.
hours = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
for hour in hours:
print("Studied {} hours and got {} points.".format(hour, lr.get_prediction_based_on(hour)))
# Printing the class attribute values.
print(lr)
if __name__ == "__main__": main()
Dataset snippet:
32.502345269453031,31.70700584656992
53.426804033275019,68.77759598163891
61.530358025636438,62.562382297945803
47.475639634786098,71.546632233567777
59.813207869512318,87.230925133687393
55.142188413943821,78.211518270799232
52.550014442733818,71.300879886850353
45.419730144973755,55.165677145959123
2 Answers 2
About OOP
I tried to exploit OOP as much as I could, instead of using a procedural approach to write the algorithm.
Although I believe that your approach was fine, using OOP for the sake of OOP is something I would rather warn against. There is a talk about this here.
Comments
def __init__(self, dataset, learning_rate, num_iterations):
"""
Class constructor.
"""
The comment Class constructor
is redundant and unecessary, I would instead explain the parameters of __init__
in the doc string.
self.M = len(self.dataset) # 100.
Is the # 100
saying that the len(self.dataset)
is always going to be 100? It might be 100 in this case, but I highly doubt you can ensure that.
Default values
Have you considered putting default values for learning_rate
and num_iterations
? If we want a default of 100 and 0.001 for num_iterations
and learning_rate
respectively, you could rewrite __init__
like:
def __init__(self, dataset, learning_rate=0.001, num_iterations=100):
Private methods
Do you really want do_gradient_step(self)
to be considered public? Yes, there are no "true" private methods, but the convention is to put one underscore before the name to indicate it is private. Honestly, I would just call it _step(self)
.
Indentation
if __name__ == "__main__": main()
should really be:
if __name__ == "__main__":
main()
To comply with PEP 8.
-
\$\begingroup\$ to comply with pep8 there should just be one underscore before an private method or variable. \$\endgroup\$baot– baot2018年06月20日 07:33:07 +00:00Commented Jun 20, 2018 at 7:33
-
\$\begingroup\$ @baot Thanks for the heads up. Changed it. \$\endgroup\$Dair– Dair2018年06月20日 14:40:44 +00:00Commented Jun 20, 2018 at 14:40
Are there some missing good practices in the code?
Notes about training methods for Linear Regression.
- Gradient Descent is slower but uses less memory.
- Normal equations as shown below is faster but uses more memory.
Training member function
You did well in trying to use gradient descent to train a linear model. For most models like the logistic regression model: there is no actual solution to train the model. However, for the linear regression model with squared errors you can calculate the weights with the below equation. enter image description here
You can just add this method to the class with your other training functions (This is a head start on how you could implement the equation.)
def train_squared_error():
x_value = np.array([x[i, 0] for x in self.dataset])
y_value = np.array([y[i, 1] for y in self.dataset])
self.m_b = (np.transpose(x) @ x).inverse() @ np.transpose(x) @ y
Note that this is going to be faster than gradient descent because matrix multiplication like this using numpy commands are very quick. Also the @ symbol is operator overload for the .dot() method for dot product command. I recommend testing this function because I did this off of the top of my head and don't have time to check if it is 100%.
Testing Suite
https://docs.python.org/3/library/unittest.html
I also recommend testing the class extensively by creating a unit test class like below:
import unittest
class TestStringMethods(unittest.TestCase):
def test_upper(self):
self.assertEqual('foo'.upper(), 'FOO')
def test_isupper(self):
self.assertTrue('FOO'.isupper())
self.assertFalse('Foo'.isupper())
def test_split(self):
s = 'hello world'
self.assertEqual(s.split(), ['hello', 'world'])
# check that s.split fails when the separator is not a string
with self.assertRaises(TypeError):
s.split(2)
if __name__ == '__main__':
unittest.main()
Explore related questions
See similar questions with these tags.