Should Decoder Prediction Be Detached in PyTorch Training?

Question:

Hi guys I have recently started to use PyTorch for my research that needs the encoder-decoder framework. PyTorch’s tutorials on this are wonderful, but there’s a little problem: when training the decoder without teacher forcing, which means the prediction of the current time step is used as the input to the next, should the prediction be detached?

In this PyTorch tutorial, detach is used (decoder_input = topi.squeeze().detach() # detach from history as input), but it is not the case in this one (top1 = output.max(1)[1]; output = (trg[t] if teacher_force else top1)).

Both tutorials are RNN-based, so I am not sure about Transformer-based architectures. Would be grateful if someone could point out which one is the better practice :).

Asked By: Alex

Source

Answers:

Yes, you should detach it. Detaching a tensor removes it from the computational graph, so it’s no longer tracked in respect to the gradient calculations, which is exactly what you want. Since the previous token can be seen as a constant defining the starting point, it will be discarded after one time step. However if you don’t detach it, it will still be hanging around since it’s tracked in the computational graph, which consumes unnecessary memory.

Realistically, the memory overhead is usually rather small, so you would only notice it if you have a lot of time steps and are at the upper limit of your GPU memory usage. Just regard it as a micro-optimisation.

There are instances where you absolutely need to detach a tensor to avoid unwanted backpropagation, but that generally happens when the same input is used in two different models, since backward consumes the graph by default and if two different backpropagations try to go through the same path it won’t be available anymore and fail.

Answered By: Michael Jungo