How to multiply matrices in the result of using chain rule in the backpropagation algorithm

Question:

I am trying to understand how backpropagation works mathematically, and want to implement it in python with numpy. I use a feedforward neural network with one hidden layer for my calculations, sigmoid as activation function, mean squared error as error function. This is the screenshot of the result of my calculations: Screenshot , and the problem is that there is a bunch of matrices, and i cannot multiply them out completely because they don’t have same dimensions.
(In the screenshot L is the output layer, L-1 is hidden layer, L-2 is input layer, W is weight, E is error function, lowercase A is activations)

(In the code the first layer has 28*28 nodes, [because i am using MNIST database of 0-9 digits as training data], hidden layer is 15 nodes, output layer is 10 nodes).

# ho stands for hidden_output
# ih stands for input_hidden

def train(self, input_, target):
    self.input_ = input_
    self.output = self.feedforward(self.input_)

    # Derivative of error with respect to weight between output layer and hidden layer
    delta_ho = (self.output - target) * sigmoid(np.dot(self.weights_ho, self.hidden), True)) * self.hidden

    # Derivative of error with respect to weight between input layer and hidden layer
    delta_ih = (self.output - target) * sigmoid(np.dot(self.weights_ho, self.hidden), True)) * self.weights_ho * sigmoid(np.dot(self.weights_ih, self.input_), True) * self.input_

    # Adjust weights
    self.weights_ho -= delta_ho
    self.weights_ih -= delta_ih

At the delta_ho = ... line, the dimensions of the matrices are (10×1 – 10×1) * (10×1) * (1×15) so how do i compute this? Thanks for any help!

Asked By: Maxim Lopin

||

Answers:

Here is a note from CS231 of Stanford: http://cs231n.github.io/optimization-2/.

For back-propagation with matrix/vectors, one thing to remember is that the gradient w.r.t. (with respect to) a variable (matrix or vector) always have the same shape as the variable.

For example, if the loss is l, there is a matrix multiplication operation in the calculation of loss: C = A.dot(B). Let’s suppose A has shape (m, n) and B has shape (n, p) (hence C has shape (m, p)). The gradient w.r.t. C is dC, which also has shape (m, p). To obtain a matrix that has the shape as A using dC and B, we can only to dC.dot(B.T) which is the multiplication of two matrices of shape (m, p) and (p, n) to obtain dA, the gradient of the loss w.r.t. A. Similarly the gradient of the loss w.r.t. B is dB = A.T.dot(dC).

For any added operation such as sigmoid you can chain them backwards as everywhere else.

Answered By: Kevin He