I created the following Neural Network in Python. It uses weights and biases which should follow standard procedure.
# Define size of the layers, as well as the learning rate alpha and the max error
inputLayerSize = 2
hiddenLayerSize = 3
outputLayerSize = 1
alpha = 0.5
maxError = 0.001
# Import dependencies
import numpy
from sklearn import preprocessing
# Make random numbers predictable
numpy.random.seed(1)
# Define our activation function
# In this case, we use the Sigmoid function
def sigmoid(x):
output = 1/(1+numpy.exp(-x))
return output
def sigmoid_derivative(x):
return x*(1-x)
# Define the cost function
def calculateError(Y, Y_predicted):
totalError = 0
for i in range(len(Y)):
totalError = totalError + numpy.square(Y[i] - Y_predicted[i])
return totalError
# Set inputs
# Each row is (x1, x2)
X = numpy.array([
[7, 4.7],
[6.3, 6],
[6.9, 4.9],
[6.4, 5.3],
[5.8, 5.1],
[5.5, 4],
[7.1, 5.9],
[6.3, 5.6],
[6.4, 4.5],
[7.7, 6.7]
])
# Normalize the inputs
#X = preprocessing.scale(X)
# Set goals
# Each row is (y1)
Y = numpy.array([
[0],
[1],
[0],
[1],
[1],
[0],
[0],
[1],
[0],
[1]
])
# Randomly initialize our weights with mean 0
weights_1 = 2*numpy.random.random((inputLayerSize, hiddenLayerSize)) - 1
weights_2 = 2*numpy.random.random((hiddenLayerSize, outputLayerSize)) - 1
# Randomly initialize our bias with mean 0
bias_1 = 2*numpy.random.random((hiddenLayerSize)) - 1
bias_2 = 2*numpy.random.random((outputLayerSize)) - 1
# Loop 10,000 times
for i in xrange(100000):
# Feed forward through layers 0, 1, and 2
layer_0 = X
layer_1 = sigmoid(numpy.dot(layer_0, weights_1)+bias_1)
layer_2 = sigmoid(numpy.dot(layer_1, weights_2)+bias_2)
# Calculate the cost function
# How much did we miss the target value?
layer_2_error = layer_2 - Y
# In what direction is the target value?
# Were we really sure? if so, don't change too much.
layer_2_delta = layer_2_error*sigmoid_derivative(layer_2)
# How much did each layer_1 value contribute to the layer_2 error (according to the weights)?
layer_1_error = layer_2_delta.dot(weights_2.T)
# In what direction is the target layer_1?
# Were we really sure? If so, don't change too much.
layer_1_delta = layer_1_error * sigmoid_derivative(layer_1)
# Update the weights
weights_2 -= alpha * layer_1.T.dot(layer_2_delta)
weights_1 -= alpha * layer_0.T.dot(layer_1_delta)
# Update the bias
bias_2 -= alpha * numpy.sum(layer_2_delta, axis=0)
bias_1 -= alpha * numpy.sum(layer_1_delta, axis=0)
# Print the error to show that we are improving
if (i% 1000) == 0:
print "Error after "+str(i)+" iterations: " + str(calculateError(Y, layer_2))
# Exit if the error is less than maxError
if(calculateError(Y, layer_2)<maxError):
print "Goal reached after "+str(i)+" iterations: " + str(calculateError(Y, layer_2)) + " is smaller than the goal of " + str(maxError)
break
# Show results
print ""
print "Weights between Input Layer -> Hidden Layer"
print weights_1
print ""
print "Bias of Hidden Layer"
print bias_1
print ""
print "Weights between Hidden Layer -> Output Layer"
print weights_2
print ""
print "Bias of Output Layer"
print bias_2
print ""
print "Computed probabilities for SALE (rounded to 3 decimals)"
print numpy.around(layer_2, decimals=3)
print ""
print "Real probabilities for SALE"
print Y
print ""
print "Final Error"
print str(calculateError(Y, layer_2))
Using 32,000 epochs I manage to get on average a final error of 0.001.
However, compared to the MLPClassifier
(Scikit-Learn package) using the same parameters:
mlp = MLPClassifier(
hidden_layer_sizes=(3,),
max_iter=32000,
activation='logistic',
tol=0.00001,
verbose='true')
My result is pretty bad. The MLPClassifier
gets a final error of 0 when I run it on the same data, after about 10,000 epochs. For both networks I use an input layer size of 2, hidden layer size of 3 and an output layer of 1.
Why does my network need that many more epochs to train? Am I missing an important part?
1 Answer 1
- You import your dependencies at the top of your file.
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. - PEP 8 Style Guide
numpy.random.seed
doesn't really make random numbers predictable, it actually helps reproducing them.You don't need to explicitly call
str
in your lastprint
statementAccording to PEP 8 Style Guide, variable names better be lower case
separated_with_underscore
.For better performance, scale your inputs between
-1
(or0
) and1
. I see you're commenting it off.
Now let's get to why your network needs more epochs to train:
- You update parameters (weights and biases) after doing the backward pass for a single sample. This is also known as stochastic gradient descent (1 epoch = 1 sample).
- The
MLPClassifier
fromsklearn
doesn't update parameters (weights and biases) after doing the backward pass for a single sample. Instead it computes the average of parameters updates over a mini-batch of examples. This is also know as mini-batch gradient descent (1 epoch = a mini-batch of samples). For relatively big datasets, by default, theMLPClassifer
uses a batch of200
example for each epoch. But in your case, it uses the whole dataset, because your dataset is smaller than200
. When the whole dataset is used to compute the updates, we call it gradient descent (1 epoch = the whole dataset). - A single epoch in the
sklearn
implementation takes into account your whole dataset, thus it makes a more accurate updates. In your case, it takes less epochs like this to reach the minimum error.
I suggest you read about the different basic learning algorithms for neural networks and try to implement them using numpy
: gradient descent, stochastic gradient descent, and mini-batch gradient.
bias_1 = 2*numpy.random.random((hiddenLayerSize)) - 1
withbias_1 = numpy.array([1.0, 1.0, 1.0])
. Note that the scikit-learn library uses a different bias weight for each node. If I try that (achieved by exchangingbias_1 += alpha * numpy.mean(layer_1_delta)
withbias_1 += alpha * numpy.mean(layer_1_delta, axis=0)
), I still get bad results. \$\endgroup\$