Self-written Neural Network

Question 1

I created the following Neural Network in Python. It uses weights and biases which should follow standard procedure.

# Define size of the layers, as well as the learning rate alpha and the max error
inputLayerSize = 2
hiddenLayerSize = 3
outputLayerSize = 1
alpha = 0.5
maxError = 0.001
# Import dependencies
import numpy
from sklearn import preprocessing
# Make random numbers predictable
numpy.random.seed(1)
# Define our activation function
# In this case, we use the Sigmoid function
def sigmoid(x):
 output = 1/(1+numpy.exp(-x))
 return output
def sigmoid_derivative(x):
 return x*(1-x)
# Define the cost function
def calculateError(Y, Y_predicted):
 totalError = 0
 for i in range(len(Y)):
 totalError = totalError + numpy.square(Y[i] - Y_predicted[i])
 return totalError
# Set inputs
# Each row is (x1, x2)
X = numpy.array([
 [7, 4.7],
 [6.3, 6],
 [6.9, 4.9],
 [6.4, 5.3],
 [5.8, 5.1],
 [5.5, 4],
 [7.1, 5.9],
 [6.3, 5.6],
 [6.4, 4.5],
 [7.7, 6.7]
 ])
# Normalize the inputs
#X = preprocessing.scale(X)
# Set goals
# Each row is (y1)
Y = numpy.array([
 [0],
 [1],
 [0],
 [1],
 [1],
 [0],
 [0],
 [1],
 [0],
 [1]
 ])
# Randomly initialize our weights with mean 0
weights_1 = 2*numpy.random.random((inputLayerSize, hiddenLayerSize)) - 1
weights_2 = 2*numpy.random.random((hiddenLayerSize, outputLayerSize)) - 1
# Randomly initialize our bias with mean 0
bias_1 = 2*numpy.random.random((hiddenLayerSize)) - 1
bias_2 = 2*numpy.random.random((outputLayerSize)) - 1
# Loop 10,000 times
for i in xrange(100000):
 # Feed forward through layers 0, 1, and 2
 layer_0 = X
 layer_1 = sigmoid(numpy.dot(layer_0, weights_1)+bias_1)
 layer_2 = sigmoid(numpy.dot(layer_1, weights_2)+bias_2)
 # Calculate the cost function
 # How much did we miss the target value?
 layer_2_error = layer_2 - Y
 # In what direction is the target value?
 # Were we really sure? if so, don't change too much.
 layer_2_delta = layer_2_error*sigmoid_derivative(layer_2)
 # How much did each layer_1 value contribute to the layer_2 error (according to the weights)?
 layer_1_error = layer_2_delta.dot(weights_2.T)
 # In what direction is the target layer_1?
 # Were we really sure? If so, don't change too much.
 layer_1_delta = layer_1_error * sigmoid_derivative(layer_1)
 # Update the weights
 weights_2 -= alpha * layer_1.T.dot(layer_2_delta)
 weights_1 -= alpha * layer_0.T.dot(layer_1_delta)
 # Update the bias 
 bias_2 -= alpha * numpy.sum(layer_2_delta, axis=0)
 bias_1 -= alpha * numpy.sum(layer_1_delta, axis=0)
 # Print the error to show that we are improving
 if (i% 1000) == 0:
 print "Error after "+str(i)+" iterations: " + str(calculateError(Y, layer_2))
 # Exit if the error is less than maxError
 if(calculateError(Y, layer_2)<maxError):
 print "Goal reached after "+str(i)+" iterations: " + str(calculateError(Y, layer_2)) + " is smaller than the goal of " + str(maxError)
 break
# Show results
print ""
print "Weights between Input Layer -> Hidden Layer"
print weights_1
print ""
print "Bias of Hidden Layer"
print bias_1
print ""
print "Weights between Hidden Layer -> Output Layer"
print weights_2
print ""
print "Bias of Output Layer"
print bias_2
print ""
print "Computed probabilities for SALE (rounded to 3 decimals)"
print numpy.around(layer_2, decimals=3)
print ""
print "Real probabilities for SALE"
print Y
print ""
print "Final Error"
print str(calculateError(Y, layer_2))

Using 32,000 epochs I manage to get on average a final error of 0.001.

However, compared to the MLPClassifier (Scikit-Learn package) using the same parameters:

mlp = MLPClassifier(
 hidden_layer_sizes=(3,),
 max_iter=32000,
 activation='logistic',
 tol=0.00001,
 verbose='true')

My result is pretty bad. The MLPClassifier gets a final error of 0 when I run it on the same data, after about 10,000 epochs. For both networks I use an input layer size of 2, hidden layer size of 3 and an output layer of 1.

Why does my network need that many more epochs to train? Am I missing an important part?

Question 2

For one, you want to have one or zero bias nodes per layer, not a "bias parameter" for each node in a layer.

Question 3

Are you sure? The way I've solved this is I don't really have a different bias for each node, but instead a different weight for each node bias. Explanation: Each node has a default bias of 1. However, each node has a different bias weight. Since bias = bias * biasWeight we can ignore the "default bias of 1" and just use the bias weight directly. Or is this wrong?

Question 4

I just tested it with using the same bias (an biasWeight) for the hidden layer and the output layer. Still bad results (final error of >0.5). You can try it for yourself by exchanging bias_1 = 2*numpy.random.random((hiddenLayerSize)) - 1 with bias_1 = numpy.array([1.0, 1.0, 1.0]). Note that the scikit-learn library uses a different bias weight for each node. If I try that (achieved by exchanging bias_1 += alpha * numpy.mean(layer_1_delta) with bias_1 += alpha * numpy.mean(layer_1_delta, axis=0)), I still get bad results.

Question 5

I have edited my post and now it uses a different bias_weight for each node.

Question 6

Why is the input data not normalized? (Don't know if related to the solution)

Question 7

You import your dependencies at the top of your file.

Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. - PEP 8 Style Guide

numpy.random.seed doesn't really make random numbers predictable, it actually helps reproducing them.
You don't need to explicitly call str in your last print statement
According to PEP 8 Style Guide, variable names better be lower case separated_with_underscore.
For better performance, scale your inputs between -1 (or 0) and 1. I see you're commenting it off.

Now let's get to why your network needs more epochs to train:

You update parameters (weights and biases) after doing the backward pass for a single sample. This is also known as stochastic gradient descent (1 epoch = 1 sample).
The MLPClassifier from sklearn doesn't update parameters (weights and biases) after doing the backward pass for a single sample. Instead it computes the average of parameters updates over a mini-batch of examples. This is also know as mini-batch gradient descent (1 epoch = a mini-batch of samples). For relatively big datasets, by default, the MLPClassifer uses a batch of 200 example for each epoch. But in your case, it uses the whole dataset, because your dataset is smaller than 200. When the whole dataset is used to compute the updates, we call it gradient descent (1 epoch = the whole dataset).
A single epoch in the sklearn implementation takes into account your whole dataset, thus it makes a more accurate updates. In your case, it takes less epochs like this to reach the minimum error.

I suggest you read about the different basic learning algorithms for neural networks and try to implement them using numpy: gradient descent, stochastic gradient descent, and mini-batch gradient.

Adel Redjimi Adel Redjimi 3812 silver badges9 bronze badges · Answer 1 · 2018-01-03 16:41:54Z

You import your dependencies at the top of your file.

Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. - PEP 8 Style Guide

numpy.random.seed doesn't really make random numbers predictable, it actually helps reproducing them.
You don't need to explicitly call str in your last print statement
According to PEP 8 Style Guide, variable names better be lower case separated_with_underscore.
For better performance, scale your inputs between -1 (or 0) and 1. I see you're commenting it off.

Now let's get to why your network needs more epochs to train:

You update parameters (weights and biases) after doing the backward pass for a single sample. This is also known as stochastic gradient descent (1 epoch = 1 sample).
The MLPClassifier from sklearn doesn't update parameters (weights and biases) after doing the backward pass for a single sample. Instead it computes the average of parameters updates over a mini-batch of examples. This is also know as mini-batch gradient descent (1 epoch = a mini-batch of samples). For relatively big datasets, by default, the MLPClassifer uses a batch of 200 example for each epoch. But in your case, it uses the whole dataset, because your dataset is smaller than 200. When the whole dataset is used to compute the updates, we call it gradient descent (1 epoch = the whole dataset).
A single epoch in the sklearn implementation takes into account your whole dataset, thus it makes a more accurate updates. In your case, it takes less epochs like this to reach the minimum error.

I suggest you read about the different basic learning algorithms for neural networks and try to implement them using numpy: gradient descent, stochastic gradient descent, and mini-batch gradient.

Stack Exchange Network

Self-written Neural Network

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Self-written Neural Network

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions