How could the softmax layer be used as one of the middle layers in a neural network and be backpropagated properly?
Question:
I am currently making my own machine learning library in c++ as an exercise to help me improve my coding skills and to improve my understanding of machine learning. I am currently making a visual transformer from scratch in order to better my understanding of transformers and how they can be used for images. Part of the code that I am trying to re-create the backward pass for is below:
energy = torch.einsum('bhqd, bhkd -> bhqk', queries, keys) # batch, num_heads, query_len, key_len
scaling = self.emb_size ** (1/2)
att = F.softmax(energy, dim=-1) / scaling
where queries and keys are 4th dimensional tensors, and the softmax function is applied to each channel, and energy
is matrix multiplication applied to each individual matrix. To recreate the back propagation I am trying to do it in terms of smaller 2D matrices.
I am going to define q
as a 3x2
matrix and k
as a 2x3
matrix. Therefore if I were to multiply them and get a matrix a
I would get a 3x3
matrix, once the softmax function is applied the dimensions stay the same.
What I am having trouble with is finding the gradient of q
and k
with respect to the loss. This is how I would find the gradient of a
or in this case dLdA
:
for i in range(len(a)):
for j in range(len(a)):
if i == j:
dLdA[i,j] = a[i] * (1-a[i])
else:
dLdA[i,j] = -a[i] * a[j]
I put it in terms of python for easier readability. From there I would get a 9x9
matrix for dLdA
. From there I need to get dLdQ
and dLdK
. Where dLdQ
= dLdA
*dAdQ
and dLdK
= dLdA
* dAdK
. If from there I wanted to compute the Jacobian of dAdK
from what I understand I would get this:
dAdK =
[dA1dK1 dA1dK2 ... dA1dK6]
.
.
.
[dA9dK1 dA9dK2 ... dA9dK6]
Because there are 9
total elements in a
and 6
total elements in k
. Where dAdK
is then a 9x6
matrix and dLdA
is a 9x9
matrix when you do matrix multiplication you get a 9x6
matrix, but k
is a 2x3
matrix. What am I missing to be able to backpropagate this correctly?
Answers:
So I am basing all the math to answer this question off this post. If you want an in depth explanation of why the math works I recommend reading it. This is how it would be implemented using numpy
in a simpler way, this is how it would work:
class Linear:
def __init__(self, in_rows, out_rows):
self.Weight = np.random.rand(out_rows, in_rows)
self.Bias = np.random.rand(out_rows, 1)
self.prev = None
self.LR = 0.01
def forward(self, x):
self.prev = x
a = np.dot(self.Weight, x)
a += self.Bias
return a
def backward(self, dx):
db = np.sum(dx, axis=len(dx.shape)-1)
db = db.reshape(self.Bias.shape)
self.Bias -= (db*self.LR)
dw = np.dot(dx, self.prev.T)
dl = np.dot(self.Weight.T, dx)
self.Weight -= (dw * self.LR)
return dl
class Softmax:
def forward(self, x):
mx = np.max(x, axis=1, keepdims=True)
x = x - mx # log-sum-exp trick
e = np.exp(x)
probs = e / np.sum(np.exp(x), axis=1, keepdims=True)
return probs
def jacobian(self, a):
r = a.shape[0]
c = a.shape[1]
output = np.zeros((r,c,r,c))
for i in range(output.shape[0]):
for j in range(output.shape[1]):
for r in range(output.shape[2]):
for s in range(output.shape[3]):
if(i == r and j == s):
output[i][j][r][s] += a[i][j]*(1-a[i][j])
else:
output[i][j][r][s] -= a[i][j]*a[r][s]
return output
def backward(self, dx):
jac = self.jacobian(dx)
jac = np.sum(jac, axis=(0,1))
return jac
def train(x, wanted, m):
model = [Linear(3,4), Linear(4,4), Softmax(), Linear(4,3)]
for i in range(0, m):
curr = x
for layer in model:
curr = layer.forward(curr)
dL = curr - wanted
if(i % 10 == 0):
print("current error sum: ", np.sum(dL))
for layer in reversed(model):
dL = layer.backward(dL)
if __name__ == '__main__':
x = np.random.rand(3,2)
wanted = np.random.rand(3,2)
train(x, wanted, 100)
One of the outputs I got was the following:
current error sum: 9.846307612361286
current error sum: 5.5827143820787395
current error sum: 3.073972734037421
current error sum: 1.622572236217964
current error sum: 0.7876682647260167
current error sum: 0.3118719708795148
current error sum: 0.04493155201662072
Which shows that the error is able to be corrected. Obviously a better network can be made, and a better error function can and should be used. This is just a demonstration using computer science to implement the post mentioned above. This is another reason I wrote everything out without as much optimization. One thing to be noted is that I did add that in the backward function, the Jacobian was added over the first 2 dimensions of the 4D tensor to turn it into a 2D tensor. The operations done come out to the same answer if the 4D jacobian of dA/dW and dA/dx were used to find dL/dW and dL/dx respectively and then summed afterwards. This is done to make the program slightly more efficient and easier to show the separate layers and how it would translate to a usual torch.nn.Linear
layer.
I am currently making my own machine learning library in c++ as an exercise to help me improve my coding skills and to improve my understanding of machine learning. I am currently making a visual transformer from scratch in order to better my understanding of transformers and how they can be used for images. Part of the code that I am trying to re-create the backward pass for is below:
energy = torch.einsum('bhqd, bhkd -> bhqk', queries, keys) # batch, num_heads, query_len, key_len
scaling = self.emb_size ** (1/2)
att = F.softmax(energy, dim=-1) / scaling
where queries and keys are 4th dimensional tensors, and the softmax function is applied to each channel, and energy
is matrix multiplication applied to each individual matrix. To recreate the back propagation I am trying to do it in terms of smaller 2D matrices.
I am going to define q
as a 3x2
matrix and k
as a 2x3
matrix. Therefore if I were to multiply them and get a matrix a
I would get a 3x3
matrix, once the softmax function is applied the dimensions stay the same.
What I am having trouble with is finding the gradient of q
and k
with respect to the loss. This is how I would find the gradient of a
or in this case dLdA
:
for i in range(len(a)):
for j in range(len(a)):
if i == j:
dLdA[i,j] = a[i] * (1-a[i])
else:
dLdA[i,j] = -a[i] * a[j]
I put it in terms of python for easier readability. From there I would get a 9x9
matrix for dLdA
. From there I need to get dLdQ
and dLdK
. Where dLdQ
= dLdA
*dAdQ
and dLdK
= dLdA
* dAdK
. If from there I wanted to compute the Jacobian of dAdK
from what I understand I would get this:
dAdK =
[dA1dK1 dA1dK2 ... dA1dK6]
.
.
.
[dA9dK1 dA9dK2 ... dA9dK6]
Because there are 9
total elements in a
and 6
total elements in k
. Where dAdK
is then a 9x6
matrix and dLdA
is a 9x9
matrix when you do matrix multiplication you get a 9x6
matrix, but k
is a 2x3
matrix. What am I missing to be able to backpropagate this correctly?
So I am basing all the math to answer this question off this post. If you want an in depth explanation of why the math works I recommend reading it. This is how it would be implemented using numpy
in a simpler way, this is how it would work:
class Linear:
def __init__(self, in_rows, out_rows):
self.Weight = np.random.rand(out_rows, in_rows)
self.Bias = np.random.rand(out_rows, 1)
self.prev = None
self.LR = 0.01
def forward(self, x):
self.prev = x
a = np.dot(self.Weight, x)
a += self.Bias
return a
def backward(self, dx):
db = np.sum(dx, axis=len(dx.shape)-1)
db = db.reshape(self.Bias.shape)
self.Bias -= (db*self.LR)
dw = np.dot(dx, self.prev.T)
dl = np.dot(self.Weight.T, dx)
self.Weight -= (dw * self.LR)
return dl
class Softmax:
def forward(self, x):
mx = np.max(x, axis=1, keepdims=True)
x = x - mx # log-sum-exp trick
e = np.exp(x)
probs = e / np.sum(np.exp(x), axis=1, keepdims=True)
return probs
def jacobian(self, a):
r = a.shape[0]
c = a.shape[1]
output = np.zeros((r,c,r,c))
for i in range(output.shape[0]):
for j in range(output.shape[1]):
for r in range(output.shape[2]):
for s in range(output.shape[3]):
if(i == r and j == s):
output[i][j][r][s] += a[i][j]*(1-a[i][j])
else:
output[i][j][r][s] -= a[i][j]*a[r][s]
return output
def backward(self, dx):
jac = self.jacobian(dx)
jac = np.sum(jac, axis=(0,1))
return jac
def train(x, wanted, m):
model = [Linear(3,4), Linear(4,4), Softmax(), Linear(4,3)]
for i in range(0, m):
curr = x
for layer in model:
curr = layer.forward(curr)
dL = curr - wanted
if(i % 10 == 0):
print("current error sum: ", np.sum(dL))
for layer in reversed(model):
dL = layer.backward(dL)
if __name__ == '__main__':
x = np.random.rand(3,2)
wanted = np.random.rand(3,2)
train(x, wanted, 100)
One of the outputs I got was the following:
current error sum: 9.846307612361286
current error sum: 5.5827143820787395
current error sum: 3.073972734037421
current error sum: 1.622572236217964
current error sum: 0.7876682647260167
current error sum: 0.3118719708795148
current error sum: 0.04493155201662072
Which shows that the error is able to be corrected. Obviously a better network can be made, and a better error function can and should be used. This is just a demonstration using computer science to implement the post mentioned above. This is another reason I wrote everything out without as much optimization. One thing to be noted is that I did add that in the backward function, the Jacobian was added over the first 2 dimensions of the 4D tensor to turn it into a 2D tensor. The operations done come out to the same answer if the 4D jacobian of dA/dW and dA/dx were used to find dL/dW and dL/dx respectively and then summed afterwards. This is done to make the program slightly more efficient and easier to show the separate layers and how it would translate to a usual torch.nn.Linear
layer.