I am working with the following code:
import numpy as np
def sigmoid(x):
return 1.0/(1.0 + np.exp(-x))
def sigmoid_prime(x):
return sigmoid(x)*(1.0-sigmoid(x))
def tanh(x):
return np.tanh(x)
def tanh_prime(x):
return 1.0 - x**2
class NeuralNetwork:
def __init__(self, layers, activation='tanh'):
if activation == 'sigmoid':
self.activation = sigmoid
self.activation_prime = sigmoid_prime
elif activation == 'tanh':
self.activation = tanh
self.activation_prime = tanh_prime
# Set weights
self.weights = []
# layers = [2,2,1]
# range of weight values (-1,1)
# input and hidden layers - random((2+1, 2+1)) : 3 x 3
for i in range(1, len(layers) - 1):
r = 2*np.random.random((layers[i-1] + 1, layers[i] + 1)) -1
self.weights.append(r)
# output layer - random((2+1, 1)) : 3 x 1
r = 2*np.random.random( (layers[i] + 1, layers[i+1])) - 1
self.weights.append(r)
def fit(self, X, y, learning_rate=0.2, epochs=100000):
# Add column of ones to X
# This is to add the bias unit to the input layer
ones = np.atleast_2d(np.ones(X.shape[0]))
X = np.concatenate((ones.T, X), axis=1)
for k in range(epochs):
if k % 10000 == 0: print 'epochs:', k
i = np.random.randint(X.shape[0])
a = [X[i]]
for l in range(len(self.weights)):
dot_value = np.dot(a[l], self.weights[l])
activation = self.activation(dot_value)
a.append(activation)
# output layer
error = y[i] - a[-1]
deltas = [error * self.activation_prime(a[-1])]
# we need to begin at the second to last layer
# (a layer before the output layer)
for l in range(len(a) - 2, 0, -1):
deltas.append(deltas[-1].dot(self.weights[l].T)*self.activation_prime(a[l]))
# reverse
# [level3(output)->level2(hidden)] => [level2(hidden)->level3(output)]
deltas.reverse()
# backpropagation
# 1. Multiply its output delta and input activation
# to get the gradient of the weight.
# 2. Subtract a ratio (percentage) of the gradient from the weight.
for i in range(len(self.weights)):
layer = np.atleast_2d(a[i])
delta = np.atleast_2d(deltas[i])
self.weights[i] += learning_rate * layer.T.dot(delta)
def predict(self, x):
a = np.concatenate((np.ones(1).T, np.array(x)), axis=0)
for l in range(0, len(self.weights)):
a = self.activation(np.dot(a, self.weights[l]))
return a
if __name__ == '__main__':
nn = NeuralNetwork([2,2,1])
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
y = np.array([0, 1, 1, 0])
nn.fit(X, y)
for e in X:
print(e,nn.predict(e))
While this converges well and fast when using the tanh, it does converge much slower when using the sigmoid ( in def __init__(self, layers, activation='tanh')
change tanh
to sigmoid
).
I cannot find why that is. How do I improve the implementation for the sigmoid?
-
\$\begingroup\$ Are you interested in reviews on aspects of the code unrelated to the sigmoid implementation? \$\endgroup\$mdfst13– mdfst132016年05月04日 19:14:21 +00:00Commented May 4, 2016 at 19:14
-
\$\begingroup\$ Of course I am! \$\endgroup\$user– user2016年05月04日 19:21:31 +00:00Commented May 4, 2016 at 19:21
1 Answer 1
The reasons for the speed discrepancy
The reason for the differences in timing are because evaluating
sigmoid_prime()
takes far longer thantanh_prime()
. You can see this if you use a line profiler such as theline_profiler
module.Is
tanh_prime()
supposed to be the derivative oftanh()
? If so, you might want to double-check your formula. The derivative oftanh(x)
is1. - tanh(x)**2
, not1. - x**2
.In fact, if you use the the actual definition of the derivative of
tanh()
, the timings become much more similar.def tanh_prime_alt(x): return 1 - tanh(x)**2 foo = np.random.rand(10000) %timeit -n 100 tanh_prime(foo) %timeit -n 100 tanh_prime_alt(foo) %timeit -n 100 sigmoid_prime(foo) 100 loops, best of 3: 10.2 μs per loop 100 loops, best of 3: 116 μs per loop 100 loops, best of 3: 279 μs per loop
So with this alternate
tanh_prime()
, the sigmoid method is now only 2× slower, not 20× slower. I should emphasize that (a) I don't know enough about neural networks to know if1. - x**2
is an appropriate expression or approximation to the actual derivative oftanh()
, but if it is in fact OK, then (b) the reason thatactivation = 'tanh'
is so much faster is because of this approximation/error.The remaining 2× difference is because in your factored expression of
sigmoid_prime()
, you are needlessly evaluatingsigmoid()
twice. I'd instead do this:def sigmoid_prime_alt(x): sig_x = sigmoid(x) return sig_x - sig_x**2
As expected, this speeds things up two-fold relative to your original definition.
foo = np.random.rand(10000) %timeit -n 100 sigmoid_prime(foo) %timeit -n 100 sigmoid_prime_alt(foo) 100 loops, best of 3: 248 μs per loop 100 loops, best of 3: 132 μs per loop
Since the
sigmoid()
function and thetanh()
function are related bytanh(x) = 2 * sigmoid(2*x) - 1
, i.e.sigmoid(x) = (1 + tanh(x/2.))/2
, then iff you are OK with the weird1 - x**2
approximation fortanh_prime()
, you should be able to work out a similar approximation forsigmoid_prime()
.You might be interested in the
autograd
module, which provides a generalized capability to compute symbolic derivatives of most NumPy code.
Other comments
These comments aren't a thorough review, but just some things I noticed.
Why are your
weights
Python lists instead of NumPy arrays? If you're already using NumPy, you might as well use it wherever you can.You probably don't need the
for l in range(len(self.weights)):
loop, do you? Can't you use NumPy array slicing and the matrix capabilities ofnp.dot()
to replace this loop?If you are going to loop, you don't need to do
for l in range(len(self.weights))
and then referenceself.weights[l]
. You can dofor weight in self.weights:
and then referenceweight
in your loop code, for example.Write some docstrings for your functions please!