Iris Data Set consists of three classes in which versicolor and virginica are not linearly separable from each other.
I constructed a subset for these two classes, here is the code
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
iris = load_iris()
x_train = iris.data[50:]
y_train = iris.target[50:]
y_train = y_train - 1
x_train, x_test, y_train, y_test = train_test_split(
x_train, y_train, test_size=0.33, random_state=2021)
and then I built a Logistic Regression model for this binary classification
def sigmoid(z):
s = 1 / (1 + np.exp(-z))
return s
class LogisticRegression:
def __init__(self, eta=.05, n_epoch=10, model_w=np.full(4, .5), model_b=.0):
self.eta = eta
self.n_epoch = n_epoch
self.model_w = model_w
self.model_b = model_b
def activation(self, x):
z = np.dot(x, self.model_w) + self.model_b
return sigmoid(z)
def predict(self, x):
a = self.activation(x)
if a >= 0.5:
return 1
else:
return 0
def update_weights(self, x, y, verbose=False):
a = self.activation(x)
dz = a - y
self.model_w -= self.eta * dz * x
self.model_b -= self.eta * dz
def fit(self, x, y, verbose=False, seed=None):
indices = np.arange(len(x))
for i in range(self.n_epoch):
n_iter = 0
np.random.seed(seed)
np.random.shuffle(indices)
for idx in indices:
if(self.predict(x[idx])!=y[idx]):
self.update_weights(x[idx], y[idx], verbose)
else:
n_iter += 1
if(n_iter==len(x)):
print('model gets 100% train accuracy after {} epoch(s)'.format(i))
break
I added the param seed
for reproduction.
import time
start_time = time.time()
w_mnist = np.full(4, .1)
classifier_mnist = LogisticRegression(.05, 1000, w_mnist)
classifier_mnist.fit(x_train, y_train, seed=0)
print('model trained {:.5f} s'.format(time.time() - start_time))
y_prediction = np.array(list(map(classifier_mnist.predict, x_train)))
acc = np.count_nonzero(y_prediction==y_train)
print('train accuracy {:.5f}'.format(acc/len(y_train)))
y_prediction = np.array(list(map(classifier_mnist.predict, x_test)))
acc = np.count_nonzero(y_prediction==y_test)
print('test accuracy {:.5f}'.format(acc/len(y_test)))
The accuracy is
train accuracy 0.95522
test accuracy 0.96970
the link is my github repo
1 Answer 1
This is a very nice little project but there are some thing to upgrade here :)
Code beautification
- Split everything to functions, there is no reason to put logic outside of a function, including the prediction part (this will remove the code duplication) and call everything from a
main
function. For example a loading function:
def load_and_split_iris(data_cut: int=50, train_test_ratio: float=0,333)
iris = load_iris()
x_train = iris.data[data_cut:]
y_train = iris.target[data_cut:]
y_train = y_train - 1
x_train, x_test, y_train, y_test = train_test_split(
x_train, y_train, test_size=train_test_ratio, random_state=2021)
return x_train, x_test, y_train, y_test
- Magic numbers make your code look bad, turn them into a
CODE_CONSTANTS
. - I really like type annotations, it will make your code more understandable for future usage and you will not confuse with the types. I added them in the code example in 1. Another example:
def fit(self, x: np.array, y: np.array, verbose: bool=False, seed: int=None):
. Type annotation can also declare return type, read into that. - String formatting, this:
'model gets 100% train accuracy after {} epoch(s)'.format(i)
and turn intof'model gets 100% train accuracy after {i} epoch(s)'
.
Bug
You reset the seed every loop (LogisticRegression.fit
), in case you are passing None
this is fine (since the OS will generate random for you) but if you pass a specific seed the numbers will be the same each time you shuffle. Take the seed setting outside of the loop.
Future work
If you are looking to continue the work I recommend to try and create a multiclass logistic regression.