How to get current available GPUs in tensorflow?
Question:
I have a plan to use distributed TensorFlow, and I saw TensorFlow can use GPUs for training and testing. In a cluster environment, each machine could have 0 or 1 or more GPUs, and I want to run my TensorFlow graph into GPUs on as many machines as possible.
I found that when running tf.Session()
TensorFlow gives information about GPU in the log messages like below:
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
My question is how do I get information about current available GPU from TensorFlow? I can get loaded GPU information from the log, but I want to do it in a more sophisticated, programmatic way.
I also could restrict GPUs intentionally using the CUDA_VISIBLE_DEVICES environment variable, so I don’t want to know a way of getting GPU information from OS kernel.
In short, I want a function like tf.get_available_gpus()
that will return ['/gpu:0', '/gpu:1']
if there are two GPUs available in the machine. How can I implement this?
Answers:
There is an undocumented method called device_lib.list_local_devices()
that enables you to list the devices available in the local process. (N.B. As an undocumented method, this is subject to backwards incompatible changes.) The function returns a list of DeviceAttributes
protocol buffer objects. You can extract a list of string device names for the GPU devices as follows:
from tensorflow.python.client import device_lib
def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
Note that (at least up to TensorFlow 1.4), calling device_lib.list_local_devices()
will run some initialization code that, by default, will allocate all of the GPU memory on all of the devices (GitHub issue). To avoid this, first create a session with an explicitly small per_process_gpu_fraction
, or allow_growth=True
, to prevent all of the memory being allocated. See this question for more details.
You can check all device list using following code:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
Apart from the excellent explanation by Mrry, where he suggested to use device_lib.list_local_devices()
I can show you how you can check for GPU related information from the command line.
Because currently only Nvidia’s gpus work for NN frameworks, the answer covers only them. Nvidia has a page where they document how you can use the /proc filesystem interface to obtain run-time information about the driver, any installed NVIDIA graphics cards, and the AGP status.
/proc/driver/nvidia/gpus/0..N/information
Provide information about
each of the installed NVIDIA graphics adapters (model name, IRQ, BIOS
version, Bus Type). Note that the BIOS version is only available while
X is running.
So you can run this from command line cat /proc/driver/nvidia/gpus/0/information
and see information about your first GPU. It is easy to run this from python and also you can check second, third, fourth GPU till it will fail.
Definitely Mrry’s answer is more robust and I am not sure whether my answer will work on non-linux machine, but that Nvidia’s page provide other interesting information, which not many people know about.
There is also a method in the test util.
So all that has to be done is:
tf.test.is_gpu_available()
and/or
tf.test.gpu_device_name()
Look up the Tensorflow docs for arguments.
The accepted answer gives you the number of GPUs but it also allocates all the memory on those GPUs. You can avoid this by creating a session with fixed lower memory before calling device_lib.list_local_devices() which may be unwanted for some applications.
I ended up using nvidia-smi to get the number of GPUs without allocating any memory on them.
import subprocess
n = str(subprocess.check_output(["nvidia-smi", "-L"])).count('UUID')
Since TensorFlow 2.1, you can use tf.config.list_physical_devices('GPU')
:
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
print("Name:", gpu.name, " Type:", gpu.device_type)
If you have two GPUs installed, it outputs this:
Name: /physical_device:GPU:0 Type: GPU
Name: /physical_device:GPU:1 Type: GPU
In TF 2.0, you must add experimental
:
gpus = tf.config.experimental.list_physical_devices('GPU')
See:
The following works in tensorflow 2:
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
print("Name:", gpu.name, " Type:", gpu.device_type)
From 2.1, you can drop experimental
:
gpus = tf.config.list_physical_devices('GPU')
https://www.tensorflow.org/api_docs/python/tf/config/list_physical_devices
Use this way and check all parts :
from __future__ import absolute_import, division, print_function, unicode_literals
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
version = tf.__version__
executing_eagerly = tf.executing_eagerly()
hub_version = hub.__version__
available = tf.config.experimental.list_physical_devices("GPU")
print("Version: ", version)
print("Eager mode: ", executing_eagerly)
print("Hub Version: ", h_version)
print("GPU is", "available" if avai else "NOT AVAILABLE")
Ensure you have the latest TensorFlow 2.x GPU installed in your GPU supporting machine,
Execute the following code in python,
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
Will get an output looks like,
2020-02-07 10:45:37.587838: I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful
NUMA node read from SysFS had negative value (-1), but there must be
at least one NUMA node, so returning NUMA node zero 2020-02-07
10:45:37.588896: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible
gpu devices: 0, 1, 2, 3, 4, 5, 6, 7 Num GPUs Available: 8
I got a GPU called NVIDIA GTX GeForce 1650 Ti
in my machine with tensorflow-gpu==2.2.0
Run the following two lines of code:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
Output:
Num GPUs Available: 1
I am working on TF-2.1 and torch, so I don’t want to specific this automacit choosing in any ML frame. I just use original nvidia-smi and os.environ to get a vacant gpu.
def auto_gpu_selection(usage_max=0.01, mem_max=0.05):
"""Auto set CUDA_VISIBLE_DEVICES for gpu
:param mem_max: max percentage of GPU utility
:param usage_max: max percentage of GPU memory
:return:
"""
os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
log = str(subprocess.check_output("nvidia-smi", shell=True)).split(r"n")[6:-1]
gpu = 0
# Maximum of GPUS, 8 is enough for most
for i in range(8):
idx = i*3 + 2
if idx > log.__len__()-1:
break
inf = log[idx].split("|")
if inf.__len__() < 3:
break
usage = int(inf[3].split("%")[0].strip())
mem_now = int(str(inf[2].split("/")[0]).strip()[:-3])
mem_all = int(str(inf[2].split("/")[1]).strip()[:-3])
# print("GPU-%d : Usage:[%d%%]" % (gpu, usage))
if usage < 100*usage_max and mem_now < mem_max*mem_all:
os.environ["CUDA_VISIBLE_EVICES"] = str(gpu)
print("nAuto choosing vacant GPU-%d : Memory:[%dMiB/%dMiB] , GPU-Util:[%d%%]n" %
(gpu, mem_now, mem_all, usage))
return
print("GPU-%d is busy: Memory:[%dMiB/%dMiB] , GPU-Util:[%d%%]" %
(gpu, mem_now, mem_all, usage))
gpu += 1
print("nNo vacant GPU, use CPU insteadn")
os.environ["CUDA_VISIBLE_EVICES"] = "-1"
If I can get any GPU, it will set CUDA_VISIBLE_EVICES to BUSID of that gpu :
GPU-0 is busy: Memory:[5738MiB/11019MiB] , GPU-Util:[60%]
GPU-1 is busy: Memory:[9688MiB/11019MiB] , GPU-Util:[78%]
Auto choosing vacant GPU-2 : Memory:[1MiB/11019MiB] , GPU-Util:[0%]
else, set to -1 to use CPU:
GPU-0 is busy: Memory:[8900MiB/11019MiB] , GPU-Util:[95%]
GPU-1 is busy: Memory:[4674MiB/11019MiB] , GPU-Util:[35%]
GPU-2 is busy: Memory:[9784MiB/11016MiB] , GPU-Util:[74%]
No vacant GPU, use CPU instead
Note: Use this function before you import any ML frame that require a GPU, then it can automatically choose a gpu. Besides, it’s easy for you to set multiple tasks.
In TensorFlow Core v2.3.0, the following code should work.
import tensorflow as tf
visible_devices = tf.config.get_visible_devices()
for devices in visible_devices:
print(devices)
Depending on your environment, this code will produce flowing results.
PhysicalDevice(name=’/physical_device:CPU:0′, device_type=’CPU’)
PhysicalDevice(name=’/physical_device:GPU:0′, device_type=’GPU’)
latest version recommended by tensorflow:
tf.config.list_physical_devices('GPU')
Run the following in any shell
python -c "import tensorflow as tf; print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))"
You can use the following code fields to show device name, type, memory and locality.
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
The accepted answer gives you the device description like:
['/device:GPU:0']
If you want more details you can use tf.config.experimental.get_device_details()
import tensorflow as tf
def get_available_gpus():
physical_gpus = tf.config.list_physical_devices(device_type="GPU")
return [(x, tf.config.experimental.get_device_details(x)) for x in physical_gpus]
This will give you details on device_name and compute_capability, e.g.:
[(PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), {'device_name': 'NVIDIA T500', 'compute_capability': (7, 5)})]
I have a plan to use distributed TensorFlow, and I saw TensorFlow can use GPUs for training and testing. In a cluster environment, each machine could have 0 or 1 or more GPUs, and I want to run my TensorFlow graph into GPUs on as many machines as possible.
I found that when running tf.Session()
TensorFlow gives information about GPU in the log messages like below:
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
My question is how do I get information about current available GPU from TensorFlow? I can get loaded GPU information from the log, but I want to do it in a more sophisticated, programmatic way.
I also could restrict GPUs intentionally using the CUDA_VISIBLE_DEVICES environment variable, so I don’t want to know a way of getting GPU information from OS kernel.
In short, I want a function like tf.get_available_gpus()
that will return ['/gpu:0', '/gpu:1']
if there are two GPUs available in the machine. How can I implement this?
There is an undocumented method called device_lib.list_local_devices()
that enables you to list the devices available in the local process. (N.B. As an undocumented method, this is subject to backwards incompatible changes.) The function returns a list of DeviceAttributes
protocol buffer objects. You can extract a list of string device names for the GPU devices as follows:
from tensorflow.python.client import device_lib
def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
Note that (at least up to TensorFlow 1.4), calling device_lib.list_local_devices()
will run some initialization code that, by default, will allocate all of the GPU memory on all of the devices (GitHub issue). To avoid this, first create a session with an explicitly small per_process_gpu_fraction
, or allow_growth=True
, to prevent all of the memory being allocated. See this question for more details.
You can check all device list using following code:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
Apart from the excellent explanation by Mrry, where he suggested to use device_lib.list_local_devices()
I can show you how you can check for GPU related information from the command line.
Because currently only Nvidia’s gpus work for NN frameworks, the answer covers only them. Nvidia has a page where they document how you can use the /proc filesystem interface to obtain run-time information about the driver, any installed NVIDIA graphics cards, and the AGP status.
/proc/driver/nvidia/gpus/0..N/information
Provide information about
each of the installed NVIDIA graphics adapters (model name, IRQ, BIOS
version, Bus Type). Note that the BIOS version is only available while
X is running.
So you can run this from command line cat /proc/driver/nvidia/gpus/0/information
and see information about your first GPU. It is easy to run this from python and also you can check second, third, fourth GPU till it will fail.
Definitely Mrry’s answer is more robust and I am not sure whether my answer will work on non-linux machine, but that Nvidia’s page provide other interesting information, which not many people know about.
There is also a method in the test util.
So all that has to be done is:
tf.test.is_gpu_available()
and/or
tf.test.gpu_device_name()
Look up the Tensorflow docs for arguments.
The accepted answer gives you the number of GPUs but it also allocates all the memory on those GPUs. You can avoid this by creating a session with fixed lower memory before calling device_lib.list_local_devices() which may be unwanted for some applications.
I ended up using nvidia-smi to get the number of GPUs without allocating any memory on them.
import subprocess
n = str(subprocess.check_output(["nvidia-smi", "-L"])).count('UUID')
Since TensorFlow 2.1, you can use tf.config.list_physical_devices('GPU')
:
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
print("Name:", gpu.name, " Type:", gpu.device_type)
If you have two GPUs installed, it outputs this:
Name: /physical_device:GPU:0 Type: GPU
Name: /physical_device:GPU:1 Type: GPU
In TF 2.0, you must add experimental
:
gpus = tf.config.experimental.list_physical_devices('GPU')
See:
The following works in tensorflow 2:
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
print("Name:", gpu.name, " Type:", gpu.device_type)
From 2.1, you can drop experimental
:
gpus = tf.config.list_physical_devices('GPU')
https://www.tensorflow.org/api_docs/python/tf/config/list_physical_devices
Use this way and check all parts :
from __future__ import absolute_import, division, print_function, unicode_literals
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
version = tf.__version__
executing_eagerly = tf.executing_eagerly()
hub_version = hub.__version__
available = tf.config.experimental.list_physical_devices("GPU")
print("Version: ", version)
print("Eager mode: ", executing_eagerly)
print("Hub Version: ", h_version)
print("GPU is", "available" if avai else "NOT AVAILABLE")
Ensure you have the latest TensorFlow 2.x GPU installed in your GPU supporting machine,
Execute the following code in python,
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
Will get an output looks like,
2020-02-07 10:45:37.587838: I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful
NUMA node read from SysFS had negative value (-1), but there must be
at least one NUMA node, so returning NUMA node zero 2020-02-07
10:45:37.588896: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible
gpu devices: 0, 1, 2, 3, 4, 5, 6, 7 Num GPUs Available: 8
I got a GPU called NVIDIA GTX GeForce 1650 Ti
in my machine with tensorflow-gpu==2.2.0
Run the following two lines of code:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
Output:
Num GPUs Available: 1
I am working on TF-2.1 and torch, so I don’t want to specific this automacit choosing in any ML frame. I just use original nvidia-smi and os.environ to get a vacant gpu.
def auto_gpu_selection(usage_max=0.01, mem_max=0.05):
"""Auto set CUDA_VISIBLE_DEVICES for gpu
:param mem_max: max percentage of GPU utility
:param usage_max: max percentage of GPU memory
:return:
"""
os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
log = str(subprocess.check_output("nvidia-smi", shell=True)).split(r"n")[6:-1]
gpu = 0
# Maximum of GPUS, 8 is enough for most
for i in range(8):
idx = i*3 + 2
if idx > log.__len__()-1:
break
inf = log[idx].split("|")
if inf.__len__() < 3:
break
usage = int(inf[3].split("%")[0].strip())
mem_now = int(str(inf[2].split("/")[0]).strip()[:-3])
mem_all = int(str(inf[2].split("/")[1]).strip()[:-3])
# print("GPU-%d : Usage:[%d%%]" % (gpu, usage))
if usage < 100*usage_max and mem_now < mem_max*mem_all:
os.environ["CUDA_VISIBLE_EVICES"] = str(gpu)
print("nAuto choosing vacant GPU-%d : Memory:[%dMiB/%dMiB] , GPU-Util:[%d%%]n" %
(gpu, mem_now, mem_all, usage))
return
print("GPU-%d is busy: Memory:[%dMiB/%dMiB] , GPU-Util:[%d%%]" %
(gpu, mem_now, mem_all, usage))
gpu += 1
print("nNo vacant GPU, use CPU insteadn")
os.environ["CUDA_VISIBLE_EVICES"] = "-1"
If I can get any GPU, it will set CUDA_VISIBLE_EVICES to BUSID of that gpu :
GPU-0 is busy: Memory:[5738MiB/11019MiB] , GPU-Util:[60%]
GPU-1 is busy: Memory:[9688MiB/11019MiB] , GPU-Util:[78%]
Auto choosing vacant GPU-2 : Memory:[1MiB/11019MiB] , GPU-Util:[0%]
else, set to -1 to use CPU:
GPU-0 is busy: Memory:[8900MiB/11019MiB] , GPU-Util:[95%]
GPU-1 is busy: Memory:[4674MiB/11019MiB] , GPU-Util:[35%]
GPU-2 is busy: Memory:[9784MiB/11016MiB] , GPU-Util:[74%]
No vacant GPU, use CPU instead
Note: Use this function before you import any ML frame that require a GPU, then it can automatically choose a gpu. Besides, it’s easy for you to set multiple tasks.
In TensorFlow Core v2.3.0, the following code should work.
import tensorflow as tf
visible_devices = tf.config.get_visible_devices()
for devices in visible_devices:
print(devices)
Depending on your environment, this code will produce flowing results.
PhysicalDevice(name=’/physical_device:CPU:0′, device_type=’CPU’)
PhysicalDevice(name=’/physical_device:GPU:0′, device_type=’GPU’)
latest version recommended by tensorflow:
tf.config.list_physical_devices('GPU')
Run the following in any shell
python -c "import tensorflow as tf; print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))"
You can use the following code fields to show device name, type, memory and locality.
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
The accepted answer gives you the device description like:
['/device:GPU:0']
If you want more details you can use tf.config.experimental.get_device_details()
import tensorflow as tf
def get_available_gpus():
physical_gpus = tf.config.list_physical_devices(device_type="GPU")
return [(x, tf.config.experimental.get_device_details(x)) for x in physical_gpus]
This will give you details on device_name and compute_capability, e.g.:
[(PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), {'device_name': 'NVIDIA T500', 'compute_capability': (7, 5)})]