Tensorflow GradientTape "Gradients does not exist for variables" intermittently
Question:
When training my network I am occasionally met with the warning:
W0722 11:47:35.101842 140641577297728 optimizer_v2.py:928] Gradients does not exist for variables ['model/conv1d_x/Variable:0'] when minimizing the loss.
This happens sporadically at infrequent intervals (maybe once in every 20 successful steps). My model basically has two paths which join together with concatenations at various positions in the network. To illustrate this, here is a simplified example of what I mean.
class myModel(tf.keras.Model):
def __init__(self):
self.conv1 = Conv2D(32)
self.conv2 = Conv2D(32)
self.conv3 = Conv2D(16)
def call(self, inputs):
net1 = self.conv1(inputs)
net2 = self.conv2(inputs)
net = tf.concat([net1, net2], axis=2)
net = self.conv3(net)
end_points = tf.nn.softmax(net)
model = myModel()
with tf.GradientTape() as tape:
predicition = model(image)
loss = myloss(labels, prediction)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
In reality my network is much larger, but the variables that generally don’t have gradients tend to be the ones at the top of the network. Before each Conv2D
layer I also have a custom gradient. Sometimes when I the error appears I can notice that the gradient function for that layer has not been called.
My question is how can the gradient tape sometimes take what appears to be different paths when propagating backwards through my network. My secondary question, is this caused by having two separate routes through my network (i.e. conv1 AND conv2). Is there a fundamental flaw in this network architecture?
Ideally, could I define to the GradientTape()
that it must find the gradients for each of the top layers?
Answers:
I had an issue that seems similar – may be helpful or not sure depending on what your network actually looks like, but basically, I had a multi-output network and I realised that as I was applying gradients that corresponded to the outputs separately, so for each separate loss there was a branch of the network for which the gradient was zero, but this was totally valid and corresponded to the terminal layers immediately prior to the non-targeted outputs each time. For this reason, I ended up replacing any None gradients with tf.zeros_like and it was possible to proceed with training. Could you have the same problem with multiple input heads to your network, if it’s always at the top of the graph?
(ETA solution by Nguyễn Thu below is the code version of what I’m describing in above – exactly the same way that I dealt with it)
I’ve seen other answers where gradients weren’t calculating because tensors aren’t watched by default – you have to add them, but looks like that’s not your issue as you should be only dealing with model.trainable_variables, or perhaps your myLoss function is getting a NaN result or casting to a numpy array occasionally depending on your batch composition, which would explain the sporadic nature (e.g. perhaps it’s on batches that have no instances of a minority class if your data is very imbalanced?)
I had the same problem. Found the solution with customized gradients
def _compute_gradients(tensor, var_list):
grads = tf.gradients(tensor, var_list)
return [grad if grad is not None else tf.zeros_like(var)
for var, grad in zip(var_list, grads)]
from github trouble shoot
I also encoutered the same error. It was because I gave the wrong trainable variables in tape.gradient()
function. If it can help someone.
In my example self.encoder_model.get_trainable_variables()
was not returning the good variables:
@tf.function
def train_step(x_batch):
with tf.GradientTape() as tape:
loss = self.encoder_model.loss.compute_loss(x_batch)
gradients = tape.gradient(loss, self.encoder_model.get_trainable_variables())
self.optimizer.apply_gradients(zip(gradients, self.encoder_model.get_trainable_variables()))
If missing gradients are expected, this warning can be suppressed by this workaround:
optimizer.apply_gradients(
(grad, var)
for (grad, var) in zip(gradients, model.trainable_variables)
if grad is not None
)
Gradient tape’s gradient
method has a unconnected_gradients
parameter that allows you to specify whether unconnected gradients should be None or Zero. See docs: https://www.tensorflow.org/api_docs/python/tf/GradientTape#gradient
So you could change the line:
gradients = tape.gradient(loss, model.trainable_variables)
to
gradients = tape.gradient(loss, model.trainable_variables,
unconnected_gradients=tf.UnconnectedGradients.ZERO)
This worked for me.
EDIT – IMPORTANT: This is only a solution if you actually expect some gradients to be zero. This is NOT a solution if the error results from a broken backpropagation. In that case you will need to find and fix where it is broken.
The solution given by Nguyễn and gkennos will suppress the error because it would replace all None
by zeros.
However, it is a big issue that your gradient is null at any point in time.
The problem described above is certainly caused by unconnected variables
(by default PyTorch will throw runtime error).
The most common case of unconnected layers can be exemplify as follow:
def some_func(x):
x1 = x * some variables
x2 = x1 + some variables #x2 discontinued after here
x3 = x1 / some variables
return x3
Now observe that x2
is unconnected, so gradient will not be propagated throw it. Carefully debug your code for unconnected variables.
there are no gradients because the variable doesn’t affect the answer.
in this code, the call function is missing a return
class myModel(tf.keras.Model):
def __init__(self):
self.conv1 = Conv2D(32)
self.conv2 = Conv2D(32)
self.conv3 = Conv2D(16)
def call(self, inputs):
net1 = self.conv1(inputs)
net2 = self.conv2(inputs)
net = tf.concat([net1, net2], axis=2)
net = self.conv3(net)
return end_points = tf.nn.softmax(net) # Change this line
TLDR make sure you are using CategoricalCrossentropy and not BinaryCrossentropy
An incorrect loss function for your application could cause this. For example if your outputs are one-hot encoded categorical labels e.g. [0,1] or [1,0] you need to use a Categorical cross entropy loss. If you use something like a Binary Cross Entropy loss by mistake then no gradients will be produced for gradients leading to the non-zeroth component of the NN output.
Revisiting this question, it is actually quite unhelpful and probably should have been down voted more! There are many scenarios where your gradient has invalid values in it. But ultimately, at some point in the gradient computation a NaN
value was created.
In my scenario I was using custom gradient op, and ultimately there was a bug in my gradient calculation code. This bug caused the NaN
under some circumstances.
If you are not using custom gradient ops, then likely you’ve either made a mistake in your network definition (e.g., disconnected variable as other answers suggest) or there is some issue with your data.
In summary, no one problem will cause this, it just an artefact from a) buggy gradient calculation, b) buggy network definition, c) issue with your data or d) anything else. There is no one solution for this question, it’s just the result of an error somewhere else.
To directly answer my questions in the original post:
Q. How can the gradient tape sometimes take what appears to be different paths when propagating backwards through my network?
A. It doesn’t, a bug in the input to the gradient function resulted in no gradients being calcucated for that layer.
Q. My secondary question, is this caused by having two separate routes through my network (i.e. conv1 AND conv2). Is there a fundamental flaw in this network architecture?
A. No, there is nothing wrong with this architecture.
When training my network I am occasionally met with the warning:
W0722 11:47:35.101842 140641577297728 optimizer_v2.py:928] Gradients does not exist for variables ['model/conv1d_x/Variable:0'] when minimizing the loss.
This happens sporadically at infrequent intervals (maybe once in every 20 successful steps). My model basically has two paths which join together with concatenations at various positions in the network. To illustrate this, here is a simplified example of what I mean.
class myModel(tf.keras.Model):
def __init__(self):
self.conv1 = Conv2D(32)
self.conv2 = Conv2D(32)
self.conv3 = Conv2D(16)
def call(self, inputs):
net1 = self.conv1(inputs)
net2 = self.conv2(inputs)
net = tf.concat([net1, net2], axis=2)
net = self.conv3(net)
end_points = tf.nn.softmax(net)
model = myModel()
with tf.GradientTape() as tape:
predicition = model(image)
loss = myloss(labels, prediction)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
In reality my network is much larger, but the variables that generally don’t have gradients tend to be the ones at the top of the network. Before each Conv2D
layer I also have a custom gradient. Sometimes when I the error appears I can notice that the gradient function for that layer has not been called.
My question is how can the gradient tape sometimes take what appears to be different paths when propagating backwards through my network. My secondary question, is this caused by having two separate routes through my network (i.e. conv1 AND conv2). Is there a fundamental flaw in this network architecture?
Ideally, could I define to the GradientTape()
that it must find the gradients for each of the top layers?
I had an issue that seems similar – may be helpful or not sure depending on what your network actually looks like, but basically, I had a multi-output network and I realised that as I was applying gradients that corresponded to the outputs separately, so for each separate loss there was a branch of the network for which the gradient was zero, but this was totally valid and corresponded to the terminal layers immediately prior to the non-targeted outputs each time. For this reason, I ended up replacing any None gradients with tf.zeros_like and it was possible to proceed with training. Could you have the same problem with multiple input heads to your network, if it’s always at the top of the graph?
(ETA solution by Nguyễn Thu below is the code version of what I’m describing in above – exactly the same way that I dealt with it)
I’ve seen other answers where gradients weren’t calculating because tensors aren’t watched by default – you have to add them, but looks like that’s not your issue as you should be only dealing with model.trainable_variables, or perhaps your myLoss function is getting a NaN result or casting to a numpy array occasionally depending on your batch composition, which would explain the sporadic nature (e.g. perhaps it’s on batches that have no instances of a minority class if your data is very imbalanced?)
I had the same problem. Found the solution with customized gradients
def _compute_gradients(tensor, var_list):
grads = tf.gradients(tensor, var_list)
return [grad if grad is not None else tf.zeros_like(var)
for var, grad in zip(var_list, grads)]
from github trouble shoot
I also encoutered the same error. It was because I gave the wrong trainable variables in tape.gradient()
function. If it can help someone.
In my example self.encoder_model.get_trainable_variables()
was not returning the good variables:
@tf.function
def train_step(x_batch):
with tf.GradientTape() as tape:
loss = self.encoder_model.loss.compute_loss(x_batch)
gradients = tape.gradient(loss, self.encoder_model.get_trainable_variables())
self.optimizer.apply_gradients(zip(gradients, self.encoder_model.get_trainable_variables()))
If missing gradients are expected, this warning can be suppressed by this workaround:
optimizer.apply_gradients(
(grad, var)
for (grad, var) in zip(gradients, model.trainable_variables)
if grad is not None
)
Gradient tape’s gradient
method has a unconnected_gradients
parameter that allows you to specify whether unconnected gradients should be None or Zero. See docs: https://www.tensorflow.org/api_docs/python/tf/GradientTape#gradient
So you could change the line:
gradients = tape.gradient(loss, model.trainable_variables)
to
gradients = tape.gradient(loss, model.trainable_variables,
unconnected_gradients=tf.UnconnectedGradients.ZERO)
This worked for me.
EDIT – IMPORTANT: This is only a solution if you actually expect some gradients to be zero. This is NOT a solution if the error results from a broken backpropagation. In that case you will need to find and fix where it is broken.
The solution given by Nguyễn and gkennos will suppress the error because it would replace all None
by zeros.
However, it is a big issue that your gradient is null at any point in time.
The problem described above is certainly caused by unconnected variables
(by default PyTorch will throw runtime error).
The most common case of unconnected layers can be exemplify as follow:
def some_func(x):
x1 = x * some variables
x2 = x1 + some variables #x2 discontinued after here
x3 = x1 / some variables
return x3
Now observe that x2
is unconnected, so gradient will not be propagated throw it. Carefully debug your code for unconnected variables.
there are no gradients because the variable doesn’t affect the answer.
in this code, the call function is missing a return
class myModel(tf.keras.Model):
def __init__(self):
self.conv1 = Conv2D(32)
self.conv2 = Conv2D(32)
self.conv3 = Conv2D(16)
def call(self, inputs):
net1 = self.conv1(inputs)
net2 = self.conv2(inputs)
net = tf.concat([net1, net2], axis=2)
net = self.conv3(net)
return end_points = tf.nn.softmax(net) # Change this line
TLDR make sure you are using CategoricalCrossentropy and not BinaryCrossentropy
An incorrect loss function for your application could cause this. For example if your outputs are one-hot encoded categorical labels e.g. [0,1] or [1,0] you need to use a Categorical cross entropy loss. If you use something like a Binary Cross Entropy loss by mistake then no gradients will be produced for gradients leading to the non-zeroth component of the NN output.
Revisiting this question, it is actually quite unhelpful and probably should have been down voted more! There are many scenarios where your gradient has invalid values in it. But ultimately, at some point in the gradient computation a NaN
value was created.
In my scenario I was using custom gradient op, and ultimately there was a bug in my gradient calculation code. This bug caused the NaN
under some circumstances.
If you are not using custom gradient ops, then likely you’ve either made a mistake in your network definition (e.g., disconnected variable as other answers suggest) or there is some issue with your data.
In summary, no one problem will cause this, it just an artefact from a) buggy gradient calculation, b) buggy network definition, c) issue with your data or d) anything else. There is no one solution for this question, it’s just the result of an error somewhere else.
To directly answer my questions in the original post:
Q. How can the gradient tape sometimes take what appears to be different paths when propagating backwards through my network?
A. It doesn’t, a bug in the input to the gradient function resulted in no gradients being calcucated for that layer.
Q. My secondary question, is this caused by having two separate routes through my network (i.e. conv1 AND conv2). Is there a fundamental flaw in this network architecture?
A. No, there is nothing wrong with this architecture.