How to add report_tensor_allocations_upon_oom to RunOptions in Keras

Question:

I’m trying to train a neural net on a GPU using Keras and am getting a “Resource exhausted: OOM when allocating tensor” error. The specific tensor it’s trying to allocate isn’t very big, so I assume some previous tensor consumed almost all the VRAM. The error message comes with a hint that suggests this:

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

That sounds good, but how do I do it? RunOptions appears to be a Tensorflow thing, and what little documentation I can find for it associates it with a “session”. I’m using Keras, so Tensorflow is hidden under a layer of abstraction and its sessions under another layer below that.

How do I dig underneath everything to set this option in such a way that it will take effect?

Asked By: dspeyer

||

Answers:

TF1 solution:

Its not as hard as it seems, what you need to know is that according to the documentation, the **kwargs parameter passed to model.compile will be passed to session.run

So you can do something like:

import tensorflow as tf
run_opts = tf.RunOptions(report_tensor_allocations_upon_oom = True)

model.compile(loss = "...", optimizer = "...", metrics = "..", options = run_opts)

And it should be passed directly each time session.run is called.

TF2:

The solution above works only for tf1. For tf2, unfortunately, it appears there is no easy solution yet.

Answered By: Dr. Snoopy

Currently, it is not possible to add the options to model.compile. See: https://github.com/tensorflow/tensorflow/issues/19911

Answered By: Richard

OOM means out of memory. May be it is using more memory at that time.
Decrease batch_size significantly. I set to 16, then it worked fine

Answered By: naam

Got the same error, but only in case, the training dataset was about the same as my GPU memory. For example, with 4 Gb video card memory I can train the model with the ~3,5 GB dataset. The workaround for me was to create the data_generator custom function, with yield, indices, and lookback.
The other way I was suggested was to start learning true tensorflow framework and with tf.Session (example).

Answered By: Vlad Stenkin

OOM is nothing but "OUT OF MEMORY".

TensorFlow throws this error when it runs out of vRAM while loading batches into memory.

I was trying to train a Vision Transformer on CIFAR-100 dataset.

GPU:
GTX 1650 w/ 4GB vRAM

Initially, I had the batch_size set to 256, which was totally insane for such a GPU, and I was getting the same OOM error.

I tweaked it to batch_size = 16 (or something lower, which your GPU can handle), training works perfectly fine.

So, always choose a smaller batch_size if you are training on laptops or mid-range GPUs.

Answered By: Pranav Durai
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.