Can we really do parallel computing using numba python library in python?

Question:

I am new to CUDA and was going through Running Python script on GPU. Performance with GPU is better than that without GPU (without GPU: 3.525673059999974, with GPU: 0.07701390800002628) for the following code executed in a colab notebook:

from numba import jit, cuda
import numpy as np

# to measure exec time
from timeit import default_timer as timer

# normal function to run on cpu
def func(a):                                
    for i in range(10000000):
        a[i]+= 1    

# function optimized to run on gpu
@jit(target_backend='cuda')                     
def func2(a):
    for i in range(10000000):
        a[i]+= 1
if __name__=="__main__":
    n = 10000000                            
    a = np.ones(n, dtype = np.float64)
    
    start = timer()
    func(a)
    print("without GPU:", timer()-start)    
    
    start = timer()
    func2(a)
    print("with GPU:", timer()-start)

System specs:
GPU: NVIDIA Tesla T4

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

CPU: Intel(R) Xeon(R) CPU @ 2.20GHz

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 79
model name  : Intel(R) Xeon(R) CPU @ 2.20GHz
stepping    : 0
microcode   : 0x1
cpu MHz     : 2199.998
cache size  : 56320 KB
physical id : 0
siblings    : 2
core id     : 0
cpu cores   : 1
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed
bogomips    : 4399.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 79
model name  : Intel(R) Xeon(R) CPU @ 2.20GHz
stepping    : 0
microcode   : 0x1
cpu MHz     : 2199.998
cache size  : 56320 KB
physical id : 0
siblings    : 2
core id     : 0
cpu cores   : 1
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed
bogomips    : 4399.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

The task performed by the thread is not an I/O bound task. Since the GIL (Global Interpreter Lock) makes the Python program effectively single-threaded, how is the performance with GPU better than that without GPU? How is multi-threading done in this case?

Asked By: Sharvani

||

Answers:

You are confusing some basic concepts of computing here. The GIL makes the CPU code effectively-single threaded, but not the GPU one. In some way, the way you rung GPU code is you "launch it" or "schedule" it in this almost separate machine that is the GPU. So your code is a single threaded CPU that request computing to this device, the GPU, and the GPU launches as many as required threads to solve the problem. This are GPU threads, not CPU threads.

A way to look at it is that the CPU is launching a computing request to a different machine (GPU), and waiting for the result.

Note this is a bit hand-wavy, there are ways to have multi-threaded CPU and GPU stuff communicate with each other and work together, but its mostly true for this type of simple example.

Note that as @paleonix suggests, the GIL does not apply to numba because its not interpreted, but Just in Time (JIT) compiled.

Answered By: Ander Biguri
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.