how to programmatically determine available GPU memory with tensorflow?

Question

For a vector quantization (k-means) program I like to know the amount of available memory on the present GPU (if there is one). This is needed to choose an optimal batch size in order to have as few batches as possible to run over the complete data set.

I have written the following test program:

import tensorflow as tf
import numpy as np
from kmeanstf import KMeansTF
print("GPU Available: ", tf.test.is_gpu_available())

nn=1000
dd=250000
print("{:,d} bytes".format(nn*dd*4))
dic = {}
for x in "ABCD":
    dic[x]=tf.random.normal((nn,dd))
    print(x,dic[x][:1,:2])

print("done...")

This is a typical output on my system with (ubuntu 18.04 LTS, GTX-1060 6GB). Please note the core dump.

python misc/maxmem.py 
GPU Available:  True
1,000,000,000 bytes
A tf.Tensor([[-0.23787294 -2.0841186 ]], shape=(1, 2), dtype=float32)
B tf.Tensor([[ 0.23762687 -1.1229591 ]], shape=(1, 2), dtype=float32)
C tf.Tensor([[-1.2672468   0.92139906]], shape=(1, 2), dtype=float32)
2020-01-02 17:35:05.988473: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 953.67MiB (rounded to 1000000000).  Current allocation summary follows.
2020-01-02 17:35:05.988752: W tensorflow/core/common_runtime/bfc_allocator.cc:424] **************************************************************************************************xx
2020-01-02 17:35:05.988835: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[1000,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Segmentation fault (core dumped)

Occasionally I do get an error from python instead of a core dump (see below). This would actually be better since I could catch it and thus determine by trial and error the maximum available memory. But it alternates with core dumps:

python misc/maxmem.py 
GPU Available:  True
1,000,000,000 bytes
A tf.Tensor([[-0.73510283 -0.94611156]], shape=(1, 2), dtype=float32)
B tf.Tensor([[-0.8458411  0.552555 ]], shape=(1, 2), dtype=float32)
C tf.Tensor([[0.30532074 0.266423  ]], shape=(1, 2), dtype=float32)
2020-01-02 17:35:26.401156: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 953.67MiB (rounded to 1000000000).  Current allocation summary follows.
2020-01-02 17:35:26.401486: W tensorflow/core/common_runtime/bfc_allocator.cc:424] **************************************************************************************************xx
2020-01-02 17:35:26.401571: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[1000,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "misc/maxmem.py", line 11, in <module>
    dic[x]=tf.random.normal((nn,dd))
  File "/home/fritzke/miniconda2/envs/tf20b/lib/python3.7/site-packages/tensorflow_core/python/ops/random_ops.py", line 76, in random_normal
    value = math_ops.add(mul, mean_tensor, name=name)
  File "/home/fritzke/miniconda2/envs/tf20b/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 391, in add
    _six.raise_from(_core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1000,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Add] name: random_normal/

How could I reliably get this information for whatever system the software is running on?

Asked By: Barden

||

Source

Answer 1

I actually found an answer in this old question of mine
. To bring some additional benefit to readers I tested the mentioned program

import nvidia_smi

nvidia_smi.nvmlInit()

handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
# card id 0 hardcoded here, there is also a call to get all available card ids, so we could iterate

info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

print("Total memory:", info.total)
print("Free memory:", info.free)
print("Used memory:", info.used)

nvidia_smi.nvmlShutdown()

on colab with the following result:

Total memory: 17071734784
Free memory: 17071734784
Used memory: 0

The actual GPU I had there was a Tesla P100 as can be seen from executing

!nvidia-smi

and observing the output

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Answered By: Barden

Answer 2

This code will return free GPU memory in MegaBytes for each GPU:

import subprocess as sp
import os

def get_gpu_memory():
    command = "nvidia-smi --query-gpu=memory.free --format=csv"
    memory_free_info = sp.check_output(command.split()).decode('ascii').split('n')[:-1][1:]
    memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
    return memory_free_values

get_gpu_memory()

This answer relies on nvidia-smi being installed (which is pretty much always the case for Nvidia GPUs) and therefore is limited to NVidia GPUs.

Answered By: y.selivonchyk

Answer 3

If you’re using tensorflow-gpu==2.5, you can use

tf.config.experimental.get_memory_info('GPU:0')

to get the actual consumed GPU memory by TF. Nvidia-smi tells you nothing, as TF allocates everything for itself and leaves nvidia-smi no information to track how much of that pre-allocated memory is actually being used.

Answered By: Captain Trojan

Answer 4

In summary, the best solution that worked well is using: tf.config.experimental.get_memory_info('DEVICE_NAME')

This function returns a dictionary with two keys:

‘current’: The current memory used by the device, in bytes
‘peak’: The peak memory used by the device across the run of the program, in bytes.

The value of these keys is the ACTUAL memory used not the allocated one that is returned by nvidia-smi.

In reality, for GPUs, TensorFlow will allocate all the memory by default rendering using nvidia-smi to check for the used memory in your code useless.
Even if, tf.config.experimental.set_memory_growth is set to true, Tensorflow will no more allocate the whole available memory but is going to remain in allocating more memory than the one is used and in a discrete manner,
i.e. allocates 4589MiB then 8717MiB then 16943MiB then 30651 MiB, etc.

A small note concerning the get_memory_info() is that it doesn’t return correct values if used in a tf.function() decorated function. Thus, the peak key shall be used after executing tf.function() decorated function
to determine the peak memory used.

For older versions of Tensorflow, tf.config.experimental.get_memory_usage('DEVICE_NAME') was the only available function and only returned the used memory (no option for determining the peak memory).

Final note, you can also consider the Tensorflow Profiler available with Tensorboard to get information regarding your memory usage.

Hope this helps 🙂

Answered By: George El Haber

Answer 5

sharing my over-engineered solution based on y.selivonchyk ‘s solution.

import os
import tempfile
import subprocess
import traceback
import pandas as pd

TH = 0.05

def get_one_available_gpu_device_id():
    gpu_device=-1
    try:
        with tempfile.TemporaryDirectory() as tmpdirname:
            fname = os.path.join(tmpdirname,'query.csv')
            cmd_list = f'nvidia-smi --format=csv --query-gpu=memory.total,memory.free,memory.used,pci.bus_id,index --filename={fname}'.split(' ')
            subprocess.check_output(cmd_list)
            if not os.path.exists(fname):
                raise ValueError("csv file not found")
            df = pd.read_csv(fname)
            df['gpu_mem_total']=df['memory.total [MiB]'].apply(lambda x: int(x.split(' ')[0]))
            df['gpu_mem_used']=df[' memory.used [MiB]'].apply(lambda x: int(x.split(' ')[1]))
            df['gpu_usage_prct']=df['gpu_mem_used']/df['gpu_mem_total']
            df['gpu_id']=df[' index']
            print(df)
            df = df.sort_values('gpu_usage_prct')
            avail = df[df.gpu_usage_prct < TH].reset_index()
            if len(avail)>0:
                gpu_device = avail.loc[0,'gpu_id']
    except:
        traceback.print_exc()

    return int(gpu_device)

gpu_device = get_one_available_gpu_device_id()
print(f'gpu_device {gpu_device}')

likely, you can then run a subprocess and specify the vacant gpu index:
f"CUDA_VISIBLE_DEVICES={gpu_device}"

Answered By: pangyuteng

how to programmatically determine available GPU memory with tensorflow?

Question:

Answers: