Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you really insistare determined on pursuing that, I would suggest you read some papers on the topic:
Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you really insist on pursuing that, I would suggest you read some papers on the topic:
Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you are determined on pursuing that, I would suggest you read some papers on the topic:
Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you really insist on pursuing that, I would suggest you read some papers on the topic:
Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training systemhttps://arxiv.org/abs/1708.04801
Zinkevich 2010: Parallelized Stochastic Gradient Descenthttp://martin.zinkevich.org/publications/nips2010.pdf
Or for a higher level overview: http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent
Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you really insist on pursuing that, I would suggest you read some papers on the topic:
Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training systemhttps://arxiv.org/abs/1708.04801
Zinkevich 2010: Parallelized Stochastic Gradient Descenthttp://martin.zinkevich.org/publications/nips2010.pdf
Or for a higher level overview: http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent
The reason why your epochs are so slow is because you are iterating over each example in the batch and calculating the gradients in a for loop. The key to speeding this up is realizing that you are performing the same operations over every example in the batch, and so you can stack your examples into a matrix and calculate the gradients for all examples in a single matrix operation.
Let's break that down. For a single example, you start with a feature vector of size (1000,) and that is linearly transformed by multiply it by a weight matrix of size (1000,900) resulting in a (900,1) vector that is then added with a bias vector of size (900,1). This is then non-linearly transformed, which does not affect the dimensions, to result in the first hidden layer of size (900,1).
This is 900 hidden nodes for the first example.
However, since we are performing the same operations to every example in the batch, we can stack the 100 examples to form a matrix of size (100,1000) instead of (1,1000) then take the dot product of this input matrix with the transpose of the weight matrix, (1000,900) for a resulting matrix of (100,900). Add the bias (1,900), which is broadcasted automatically in numpy to a matrix of size (100,900) (it's the same bias vector stacked 100 times) and apply the non-linear transform for a final matrix of size (100,900). This is 900 hidden nodes each for 100 examples.
This can be applied to each hidden layer in the network.
Factoring the original code:
for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])
# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta
into matrix math form:
a[0] = I[batch]
for l in range(nn_size-1):
z[l] = a[l].dot(W[l].T) + b[l].T
a[l+1] = sigma(z[l])
delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)
for l in reversed(range(nn_size-2)):
delta = delta.dot(W[l+1]) * sigma_prime(z[l])
dW[l] += a[l].T.dot(delta).T
db[l] += np.sum(delta.T, axis=1, keepdims=True)
When calculating gradients over batches, taking the sum or average of the gradients over the examples both work, but Andrew Ng suggests using the average over the batch as explained in his course and here: https://stats.stackexchange.com/questions/183840/sum-or-average-of-gradients-in-mini-batch-gradient-decent
In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.
With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.
With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.