Skip to main content
Code Review

Return to Question

Bumped by Community user
Bumped by Community user
Bumped by Community user
Clarify my problem with the code
Source Link
def initialize(input_dim, hidden1_dim, hidden2_dim, output_dim, batch_size):
 W1 = np.random.randn(hidden1_dim, input_dim) * 0.01
 b1 = np.zeros((hidden1_dim,))
 W2 = np.random.randn(hidden2_dim, hidden1_dim) * 0.01
 b2 = np.zeros((hidden2_dim,))
 W3 = np.random.randn(output_dim, hidden2_dim) * 0.01
 b3 = np.zeros((output_dim,))
 parameters = [W1, b1, W2, b2, W3, b3]
 x = np.random.rand(input_dim, batch_size)
 y = np.random.randn(output_dim, batch_size)
 return parameters, x, y
def sigmoid(x):
 return 1 / (1 + np.exp(-x))
def squared_loss(predictions, targets):
 return np.mean(0.5 * np.linalg.norm(predictions - targets, axis=1axis=0, ord=2)**2, axis=0)
def deriv_squared_loss(predictions, targets):
 return (predictions - targets) / targets.shape[-1]
def forward(parameters, X):
 W1, b1, W2, b2, W3, b3 = parameters
 hid_1 = sigmoid(W1 @ X + b1[:, np.newaxis])
 hid_2 = sigmoid(W2 @ hid_1 + b2[:, np.newaxis])
 outputs = W3 @ hid_2 + b3[:, np.newaxis]
 return [X, hid_1, hid_2, outputs]
def backward(activations, targets, parameters):
 X, hid_1, hid_2, predictions = activations
 W1, b1, W2, b2, W3, b3 = parameters
 batch_size = X.shape[-1]
 # ∂L/∂prediction
 dL_dPredictions = deriv_squared_loss(predictions, targets)
 # ∂L/∂b3 = ∂L/∂prediction * ∂prediction/∂b3 = ∂L/∂prediction * 1
 dL_db3 = np.dot(dL_dPredictions, np.ones((batch_size,)))
 # ∂L/∂W3 = ∂L/∂prediction * ∂prediction/∂W3 = ∂L/∂prediction * hid_2
 dL_dW3 = np.dot(dL_dPredictions, hid_2.T)
 # ∂L/∂hid_2 = ∂L/∂prediction * ∂prediction/∂hid_2 = ∂L/∂prediction * ∂prediction/∂sig(W3*X + B3) = ∂L/∂prediction * W3 * sig(W3*X + B3) * (1 - sig(W3*X + B3)) = ∂L/∂prediction * W3 * hid_2 * (1 - hid_2)
 dL_hid_2 = np.dot(W3.T, dL_dPredictions) * hid_2 * (1 - hid_2)
 # ∂L/∂b2 = ∂L/∂hid_2 * ∂prediction/∂b3 = ∂L/∂hid_2 * 1
 dL_db2 = np.dot(dL_hid_2, np.ones((batch_size,)))
 # ∂L/∂W2 = ∂L/∂hid_2 * ∂prediction/∂W2 = ∂L/∂hid_2 * hid_1
 dL_dW2 = np.dot(dL_hid_2, hid_1.T)
 dL_hid_1 = np.dot(W2.T, dL_hid_2) * hid_1 * (1 - hid_1)
 dL_db1 = np.dot(dL_hid_1, np.ones((batch_size,)))
 dL_dW1 = np.dot(dL_hid_1, X.T)
 return [dL_dW1, dL_db1, dL_dW2, dL_db2, dL_dW3, dL_db3]
parameters, X, Y = initialize(input_dim=3, hidden1_dim=4, hidden2_dim=4, output_dim=2, batch_size=5)
activations = forward(parameters, X)
grads = backward(activations, Y, parameters)

MyThe main problem with this codeis that backpropagation is quite easy in the backwardspropagation calculationsone dimensional case. The math should be clearYou can basically look at each node locally and compute a local gradient, but I struggleand then multiply it with the implementationglobal gradient calculated for the previous node. I put these formulas in numpymy code above. Concretly, I'm confused about But in the multidimensional case it gets more involved. Because matrix transpositions andmultiplication is not commutative, you need to switch sometimes the order inor transpose the dot productmatrixes. InSo the mathematicalcode doesn't follow these simple formulas (I put them above), it's always the same pattern somehow: global_gradient * local_gradient anymore. But in the code it's sometimes You can figure this out by calculating the other way round (for instance dL_hid_2) or transpositions ofderivatives with the matrices are neededmultidimensional chain rule. It's not only hard for me to write that code Perhaps if you're good you might know the derivates, but also to understand it again five minutes later. Basically I have to writethink these are not too obvious. One example is np.dot(W3.T, dL_dPredictions), where you switch the calculations on paper or printorder and transpose the dimensions of everythingmatrix compared to really understand the codeone dimensional case.

So I wonderthe question is: CanHow can I improve the code somehowin a way that it's much clearer whateasier to understand and verify. If I look five minutes later at my own code, I have no clue why I was doing exactly these calculations and have to the math again. I suppose there is going on? Ideallyno way you can understand the code would match the mathematical formulasif you have zero knowledge of what is happening mathematically. But in caseI hope I could improve this code to make it readable for someone who is not possiblegenerally familiar with the math, but maybe doesn't know every single formula by heart (which is my assumption): How would you write this in a clean way, so I don't struggle=> i.e. me five minutes later again with my own code.)

def initialize(input_dim, hidden1_dim, hidden2_dim, output_dim, batch_size):
 W1 = np.random.randn(hidden1_dim, input_dim) * 0.01
 b1 = np.zeros((hidden1_dim,))
 W2 = np.random.randn(hidden2_dim, hidden1_dim) * 0.01
 b2 = np.zeros((hidden2_dim,))
 W3 = np.random.randn(output_dim, hidden2_dim) * 0.01
 b3 = np.zeros((output_dim,))
 parameters = [W1, b1, W2, b2, W3, b3]
 x = np.random.rand(input_dim, batch_size)
 y = np.random.randn(output_dim, batch_size)
 return parameters, x, y
def sigmoid(x):
 return 1 / (1 + np.exp(-x))
def squared_loss(predictions, targets):
 return np.mean(0.5 * np.linalg.norm(predictions - targets, axis=1, ord=2)**2, axis=0)
def deriv_squared_loss(predictions, targets):
 return (predictions - targets) / targets.shape[-1]
def forward(parameters, X):
 W1, b1, W2, b2, W3, b3 = parameters
 hid_1 = sigmoid(W1 @ X + b1[:, np.newaxis])
 hid_2 = sigmoid(W2 @ hid_1 + b2[:, np.newaxis])
 outputs = W3 @ hid_2 + b3[:, np.newaxis]
 return [X, hid_1, hid_2, outputs]
def backward(activations, targets, parameters):
 X, hid_1, hid_2, predictions = activations
 W1, b1, W2, b2, W3, b3 = parameters
 batch_size = X.shape[-1]
 # ∂L/∂prediction
 dL_dPredictions = deriv_squared_loss(predictions, targets)
 # ∂L/∂b3 = ∂L/∂prediction * ∂prediction/∂b3 = ∂L/∂prediction * 1
 dL_db3 = np.dot(dL_dPredictions, np.ones((batch_size,)))
 # ∂L/∂W3 = ∂L/∂prediction * ∂prediction/∂W3 = ∂L/∂prediction * hid_2
 dL_dW3 = np.dot(dL_dPredictions, hid_2.T)
 # ∂L/∂hid_2 = ∂L/∂prediction * ∂prediction/∂hid_2 = ∂L/∂prediction * ∂prediction/∂sig(W3*X + B3) = ∂L/∂prediction * W3 * sig(W3*X + B3) * (1 - sig(W3*X + B3)) = ∂L/∂prediction * W3 * hid_2 * (1 - hid_2)
 dL_hid_2 = np.dot(W3.T, dL_dPredictions) * hid_2 * (1 - hid_2)
 # ∂L/∂b2 = ∂L/∂hid_2 * ∂prediction/∂b3 = ∂L/∂hid_2 * 1
 dL_db2 = np.dot(dL_hid_2, np.ones((batch_size,)))
 # ∂L/∂W2 = ∂L/∂hid_2 * ∂prediction/∂W2 = ∂L/∂hid_2 * hid_1
 dL_dW2 = np.dot(dL_hid_2, hid_1.T)
 dL_hid_1 = np.dot(W2.T, dL_hid_2) * hid_1 * (1 - hid_1)
 dL_db1 = np.dot(dL_hid_1, np.ones((batch_size,)))
 dL_dW1 = np.dot(dL_hid_1, X.T)
 return [dL_dW1, dL_db1, dL_dW2, dL_db2, dL_dW3, dL_db3]
parameters, X, Y = initialize(input_dim=3, hidden1_dim=4, hidden2_dim=4, output_dim=2, batch_size=5)
activations = forward(parameters, X)
grads = backward(activations, Y, parameters)

My main problem with this code is the backwardspropagation calculations. The math should be clear, but I struggle with the implementation in numpy. Concretly, I'm confused about the matrix transpositions and the order in the dot product. In the mathematical formulas (I put them above), it's always the same pattern somehow: global_gradient * local_gradient. But in the code it's sometimes the other way round (for instance dL_hid_2) or transpositions of the matrices are needed. It's not only hard for me to write that code, but also to understand it again five minutes later. Basically I have to write the calculations on paper or print the dimensions of everything to really understand the code.

So I wonder: Can I improve the code somehow that it's much clearer what is going on? Ideally the code would match the mathematical formulas. But in case this is not possible (which is my assumption): How would you write this in a clean way, so I don't struggle five minutes later again with my own code.

def initialize(input_dim, hidden1_dim, hidden2_dim, output_dim, batch_size):
 W1 = np.random.randn(hidden1_dim, input_dim) * 0.01
 b1 = np.zeros((hidden1_dim,))
 W2 = np.random.randn(hidden2_dim, hidden1_dim) * 0.01
 b2 = np.zeros((hidden2_dim,))
 W3 = np.random.randn(output_dim, hidden2_dim) * 0.01
 b3 = np.zeros((output_dim,))
 parameters = [W1, b1, W2, b2, W3, b3]
 x = np.random.rand(input_dim, batch_size)
 y = np.random.randn(output_dim, batch_size)
 return parameters, x, y
def sigmoid(x):
 return 1 / (1 + np.exp(-x))
def squared_loss(predictions, targets):
 return np.mean(0.5 * np.linalg.norm(predictions - targets, axis=0, ord=2)**2, axis=0)
def deriv_squared_loss(predictions, targets):
 return (predictions - targets) / targets.shape[-1]
def forward(parameters, X):
 W1, b1, W2, b2, W3, b3 = parameters
 hid_1 = sigmoid(W1 @ X + b1[:, np.newaxis])
 hid_2 = sigmoid(W2 @ hid_1 + b2[:, np.newaxis])
 outputs = W3 @ hid_2 + b3[:, np.newaxis]
 return [X, hid_1, hid_2, outputs]
def backward(activations, targets, parameters):
 X, hid_1, hid_2, predictions = activations
 W1, b1, W2, b2, W3, b3 = parameters
 batch_size = X.shape[-1]
 # ∂L/∂prediction
 dL_dPredictions = deriv_squared_loss(predictions, targets)
 # ∂L/∂b3 = ∂L/∂prediction * ∂prediction/∂b3 = ∂L/∂prediction * 1
 dL_db3 = np.dot(dL_dPredictions, np.ones((batch_size,)))
 # ∂L/∂W3 = ∂L/∂prediction * ∂prediction/∂W3 = ∂L/∂prediction * hid_2
 dL_dW3 = np.dot(dL_dPredictions, hid_2.T)
 # ∂L/∂hid_2 = ∂L/∂prediction * ∂prediction/∂hid_2 = ∂L/∂prediction * ∂prediction/∂sig(W3*X + B3) = ∂L/∂prediction * W3 * sig(W3*X + B3) * (1 - sig(W3*X + B3)) = ∂L/∂prediction * W3 * hid_2 * (1 - hid_2)
 dL_hid_2 = np.dot(W3.T, dL_dPredictions) * hid_2 * (1 - hid_2)
 # ∂L/∂b2 = ∂L/∂hid_2 * ∂prediction/∂b3 = ∂L/∂hid_2 * 1
 dL_db2 = np.dot(dL_hid_2, np.ones((batch_size,)))
 # ∂L/∂W2 = ∂L/∂hid_2 * ∂prediction/∂W2 = ∂L/∂hid_2 * hid_1
 dL_dW2 = np.dot(dL_hid_2, hid_1.T)
 dL_hid_1 = np.dot(W2.T, dL_hid_2) * hid_1 * (1 - hid_1)
 dL_db1 = np.dot(dL_hid_1, np.ones((batch_size,)))
 dL_dW1 = np.dot(dL_hid_1, X.T)
 return [dL_dW1, dL_db1, dL_dW2, dL_db2, dL_dW3, dL_db3]
parameters, X, Y = initialize(input_dim=3, hidden1_dim=4, hidden2_dim=4, output_dim=2, batch_size=5)
activations = forward(parameters, X)
grads = backward(activations, Y, parameters)

The main problem with this is that backpropagation is quite easy in the one dimensional case. You can basically look at each node locally and compute a local gradient, and then multiply it with the global gradient calculated for the previous node. I put these formulas in my code above. But in the multidimensional case it gets more involved. Because matrix multiplication is not commutative, you need to switch sometimes the order or transpose the matrixes. So the code doesn't follow these simple formulas above anymore. You can figure this out by calculating the derivatives with the multidimensional chain rule. Perhaps if you're good you might know the derivates, but I think these are not too obvious. One example is np.dot(W3.T, dL_dPredictions), where you switch the order and transpose the matrix compared to the one dimensional case.

So the question is: How can I improve the code in a way that it's easier to understand and verify. If I look five minutes later at my own code, I have no clue why I was doing exactly these calculations and have to the math again. I suppose there is no way you can understand the code if you have zero knowledge of what is happening mathematically. But I hope I could improve this code to make it readable for someone who is generally familiar with the math, but maybe doesn't know every single formula by heart (=> i.e. me five minutes later)

Source Link

Readable Backprogragation calculations in Numpy Neural Network

As an exercise we should write a small Neural Network with the following structure: enter image description here

There should be additionally a bias for each layer and sigmoid should be used as the activation function.

The relevant method is the backwards method, which implements backpropagation. It receives results saved from the forward propagation (X is Input, hid_1 / hid_2 the results of the hidden layer after applying the sigmoid, predictions the output), the targets and the paramaters (Weights and Bias for each layer). It should return the gradients for each parameter with respect to the loss function. I included the rest of the code as well in case it is needed.

def initialize(input_dim, hidden1_dim, hidden2_dim, output_dim, batch_size):
 W1 = np.random.randn(hidden1_dim, input_dim) * 0.01
 b1 = np.zeros((hidden1_dim,))
 W2 = np.random.randn(hidden2_dim, hidden1_dim) * 0.01
 b2 = np.zeros((hidden2_dim,))
 W3 = np.random.randn(output_dim, hidden2_dim) * 0.01
 b3 = np.zeros((output_dim,))
 parameters = [W1, b1, W2, b2, W3, b3]
 x = np.random.rand(input_dim, batch_size)
 y = np.random.randn(output_dim, batch_size)
 return parameters, x, y
def sigmoid(x):
 return 1 / (1 + np.exp(-x))
def squared_loss(predictions, targets):
 return np.mean(0.5 * np.linalg.norm(predictions - targets, axis=1, ord=2)**2, axis=0)
def deriv_squared_loss(predictions, targets):
 return (predictions - targets) / targets.shape[-1]
def forward(parameters, X):
 W1, b1, W2, b2, W3, b3 = parameters
 hid_1 = sigmoid(W1 @ X + b1[:, np.newaxis])
 hid_2 = sigmoid(W2 @ hid_1 + b2[:, np.newaxis])
 outputs = W3 @ hid_2 + b3[:, np.newaxis]
 return [X, hid_1, hid_2, outputs]
def backward(activations, targets, parameters):
 X, hid_1, hid_2, predictions = activations
 W1, b1, W2, b2, W3, b3 = parameters
 batch_size = X.shape[-1]
 # ∂L/∂prediction
 dL_dPredictions = deriv_squared_loss(predictions, targets)
 # ∂L/∂b3 = ∂L/∂prediction * ∂prediction/∂b3 = ∂L/∂prediction * 1
 dL_db3 = np.dot(dL_dPredictions, np.ones((batch_size,)))
 # ∂L/∂W3 = ∂L/∂prediction * ∂prediction/∂W3 = ∂L/∂prediction * hid_2
 dL_dW3 = np.dot(dL_dPredictions, hid_2.T)
 # ∂L/∂hid_2 = ∂L/∂prediction * ∂prediction/∂hid_2 = ∂L/∂prediction * ∂prediction/∂sig(W3*X + B3) = ∂L/∂prediction * W3 * sig(W3*X + B3) * (1 - sig(W3*X + B3)) = ∂L/∂prediction * W3 * hid_2 * (1 - hid_2)
 dL_hid_2 = np.dot(W3.T, dL_dPredictions) * hid_2 * (1 - hid_2)
 # ∂L/∂b2 = ∂L/∂hid_2 * ∂prediction/∂b3 = ∂L/∂hid_2 * 1
 dL_db2 = np.dot(dL_hid_2, np.ones((batch_size,)))
 # ∂L/∂W2 = ∂L/∂hid_2 * ∂prediction/∂W2 = ∂L/∂hid_2 * hid_1
 dL_dW2 = np.dot(dL_hid_2, hid_1.T)
 dL_hid_1 = np.dot(W2.T, dL_hid_2) * hid_1 * (1 - hid_1)
 dL_db1 = np.dot(dL_hid_1, np.ones((batch_size,)))
 dL_dW1 = np.dot(dL_hid_1, X.T)
 return [dL_dW1, dL_db1, dL_dW2, dL_db2, dL_dW3, dL_db3]
parameters, X, Y = initialize(input_dim=3, hidden1_dim=4, hidden2_dim=4, output_dim=2, batch_size=5)
activations = forward(parameters, X)
grads = backward(activations, Y, parameters)

My main problem with this code is the backwardspropagation calculations. The math should be clear, but I struggle with the implementation in numpy. Concretly, I'm confused about the matrix transpositions and the order in the dot product. In the mathematical formulas (I put them above), it's always the same pattern somehow: global_gradient * local_gradient. But in the code it's sometimes the other way round (for instance dL_hid_2) or transpositions of the matrices are needed. It's not only hard for me to write that code, but also to understand it again five minutes later. Basically I have to write the calculations on paper or print the dimensions of everything to really understand the code.

So I wonder: Can I improve the code somehow that it's much clearer what is going on? Ideally the code would match the mathematical formulas. But in case this is not possible (which is my assumption): How would you write this in a clean way, so I don't struggle five minutes later again with my own code.

lang-py

AltStyle によって変換されたページ (->オリジナル) /