How could the softmax layer be used as one of the middle layers in a neural network and be backpropagated properly?

Question:

I am currently making my own machine learning library in c++ as an exercise to help me improve my coding skills and to improve my understanding of machine learning. I am currently making a visual transformer from scratch in order to better my understanding of transformers and how they can be used for images. Part of the code that I am trying to re-create the backward pass for is below:

energy = torch.einsum('bhqd, bhkd -> bhqk', queries, keys) # batch, num_heads, query_len, key_len
            
scaling = self.emb_size ** (1/2)
att = F.softmax(energy, dim=-1) / scaling

where queries and keys are 4th dimensional tensors, and the softmax function is applied to each channel, and energy is matrix multiplication applied to each individual matrix. To recreate the back propagation I am trying to do it in terms of smaller 2D matrices.

I am going to define q as a 3x2 matrix and k as a 2x3 matrix. Therefore if I were to multiply them and get a matrix a I would get a 3x3 matrix, once the softmax function is applied the dimensions stay the same.

What I am having trouble with is finding the gradient of q and k with respect to the loss. This is how I would find the gradient of a or in this case dLdA:

for i in range(len(a)):
  for j in range(len(a)):
       if i == j:
           dLdA[i,j] = a[i] * (1-a[i])
       else: 
           dLdA[i,j] = -a[i] * a[j]

I put it in terms of python for easier readability. From there I would get a 9x9 matrix for dLdA. From there I need to get dLdQ and dLdK. Where dLdQ = dLdA*dAdQ and dLdK = dLdA * dAdK. If from there I wanted to compute the Jacobian of dAdK from what I understand I would get this:

dAdK = 
[dA1dK1 dA1dK2 ... dA1dK6]
.
.
.
[dA9dK1 dA9dK2 ... dA9dK6]

Because there are 9 total elements in a and 6 total elements in k. Where dAdK is then a 9x6 matrix and dLdA is a 9x9 matrix when you do matrix multiplication you get a 9x6 matrix, but k is a 2x3 matrix. What am I missing to be able to backpropagate this correctly?

Asked By: Sam Moldenha

||

Answers:

So I am basing all the math to answer this question off this post. If you want an in depth explanation of why the math works I recommend reading it. This is how it would be implemented using numpy in a simpler way, this is how it would work:

class Linear:
    def __init__(self, in_rows, out_rows):
        self.Weight = np.random.rand(out_rows, in_rows)
        self.Bias = np.random.rand(out_rows, 1)
        self.prev = None
        self.LR = 0.01

    def forward(self, x):
        self.prev = x
        a = np.dot(self.Weight, x)
        a += self.Bias
        return a
    def backward(self, dx):
        db = np.sum(dx, axis=len(dx.shape)-1)
        db = db.reshape(self.Bias.shape)
        self.Bias -= (db*self.LR)
        dw = np.dot(dx, self.prev.T)
        dl = np.dot(self.Weight.T, dx)
        self.Weight -= (dw * self.LR)
        return dl

class Softmax:
    def forward(self, x):
        mx = np.max(x, axis=1, keepdims=True)
        x = x - mx  # log-sum-exp trick
        e = np.exp(x)
        probs = e / np.sum(np.exp(x), axis=1, keepdims=True)
        return probs

        
    def jacobian(self, a):
        r = a.shape[0]
        c = a.shape[1]
        output = np.zeros((r,c,r,c))
        for i in range(output.shape[0]):
            for j in range(output.shape[1]):
                for r in range(output.shape[2]):
                    for s in range(output.shape[3]):
                        if(i == r and j == s):
                            output[i][j][r][s] += a[i][j]*(1-a[i][j])
                        else:
                            output[i][j][r][s] -= a[i][j]*a[r][s]
        return output

    def backward(self, dx):
        jac = self.jacobian(dx)
        jac = np.sum(jac, axis=(0,1))
        return jac

def train(x, wanted, m):
    model = [Linear(3,4), Linear(4,4), Softmax(), Linear(4,3)]
    for i in range(0, m):
        curr = x
        for layer in model:
            curr = layer.forward(curr)
        dL = curr - wanted
        if(i % 10 == 0):
            print("current error sum: ", np.sum(dL))
        for layer in reversed(model):
            dL = layer.backward(dL)


if __name__ == '__main__':
    x = np.random.rand(3,2)
    wanted = np.random.rand(3,2)
    
    train(x, wanted, 100)

One of the outputs I got was the following:

current error sum:  9.846307612361286
current error sum:  5.5827143820787395
current error sum:  3.073972734037421
current error sum:  1.622572236217964
current error sum:  0.7876682647260167
current error sum:  0.3118719708795148
current error sum:  0.04493155201662072

Which shows that the error is able to be corrected. Obviously a better network can be made, and a better error function can and should be used. This is just a demonstration using computer science to implement the post mentioned above. This is another reason I wrote everything out without as much optimization. One thing to be noted is that I did add that in the backward function, the Jacobian was added over the first 2 dimensions of the 4D tensor to turn it into a 2D tensor. The operations done come out to the same answer if the 4D jacobian of dA/dW and dA/dx were used to find dL/dW and dL/dx respectively and then summed afterwards. This is done to make the program slightly more efficient and easier to show the separate layers and how it would translate to a usual torch.nn.Linear layer.

Answered By: Sam Moldenha