Is there no faster way to convert (BGR) OpenCV image to CMYK?

Question:

I have an OpenCV image, as usual in BGR color space, and I need to convert it to CMYK. I searched online but found basically only (slight variations of) the following approach:

def bgr2cmyk(cv2_bgr_image):
    bgrdash = cv2_bgr_image.astype(float) / 255.0

    # Calculate K as (1 - whatever is biggest out of Rdash, Gdash, Bdash)
    K = 1 - numpy.max(bgrdash, axis=2)

    with numpy.errstate(divide="ignore", invalid="ignore"):
        # Calculate C
        C = (1 - bgrdash[..., 2] - K) / (1 - K)
        C = 255 * C
        C = C.astype(numpy.uint8)

        # Calculate M
        M = (1 - bgrdash[..., 1] - K) / (1 - K)
        M = 255 * M
        M = M.astype(numpy.uint8)

        # Calculate Y
        Y = (1 - bgrdash[..., 0] - K) / (1 - K)
        Y = 255 * Y
        Y = Y.astype(numpy.uint8)

    return (C, M, Y, K)

This works fine, however, it feels quite slow – for an 800 x 600 px image it takes about 30 ms on my i7 CPU. Typical operations with cv2 like thresholding and alike take only a few ms for the same image, so since this is all numpy I was expecting this CMYK conversion to be faster.

However, I haven’t found anything that makes this significantly fater. There is a conversion to CMYK via PIL.Image, but the resulting channels do not look as they do with the algorithm listed above.

Any other ideas?

Asked By: Matthias

||

Answers:

I would start by profiling which part is the bottleneck.

e.g how fast is it without the / (1 - K)calculation?
-> precalculate 1/(1-K) might help. Even precalculation of 255/(1-K) is possible.

K = 1 - numpy.max(bgrdash, axis=2)
kRez255=255/(1 - K)

with numpy.errstate(divide="ignore", invalid="ignore"):
    # Calculate C
    C = (1 - bgrdash[..., 2] - K) * kRez255
    C = C.astype(numpy.uint8)

    # Calculate M
    M = (1 - bgrdash[..., 1] - K) * kRez255
    M = M.astype(numpy.uint8)

    # Calculate Y
    Y = (1 - bgrdash[..., 0] - K) * kRez255
    Y = Y.astype(numpy.uint8)

return (C, M, Y, K)

But only profiling can show if it is the calculation at all which slows down the conversion.

Answered By: MrSmith42

There are several things you should do:

  • shake the math
  • use integer math where possible
  • optimize beyond what numpy can do
Shaking the math

Given

RGB' = RGB / 255
K = 1 - max(RGB')
C = (1-K - R') / (1-K)
M = (1-K - G') / (1-K)
Y = (1-K - B') / (1-K)

You see what you can factor out.

RGB' = RGB / 255
J = max(RGB')
K = 1 - J
C = (J - R') / J
M = (J - G') / J
Y = (J - B') / J
Integer math

Don’t normalize to [0,1] for these calculations. The max() can be done on integers. The differences can too. K can be calculated entirely with integer math.

J = max(RGB)
K = 255 - J
C = 255 * (J - R) / J
M = 255 * (J - G) / J
Y = 255 * (J - B) / J
Numba
import numba

Numba will optimize that code beyond simply using numpy library routines. It will also do the parallelization as indicated. Choosing the numpy error model and allowing fastmath will cause division by zero to not throw an exception or warning, but also make the math a little faster.

Both variants significantly outperform a plain python/numpy solution. Much of that is due to better use of CPU registers caches, rather than intermediate arrays, as is usual with numpy.

First variant: ~1.9 ms

@numba.njit(parallel=True, error_model="numpy", fastmath=True)
def bgr2cmyk_v4(bgr_img):
    bgr_img = np.ascontiguousarray(bgr_img)
    (height, width) = bgr_img.shape[:2]
    CMYK = np.empty((height, width, 4), dtype=np.uint8)
    for i in numba.prange(height):
        for j in range(width):
            B,G,R = bgr_img[i,j] 
            J = max(R, G, B)
            K = np.uint8(255 - J)
            C = np.uint8(255 * (J - R) / J)
            M = np.uint8(255 * (J - G) / J)
            Y = np.uint8(255 * (J - B) / J)
            CMYK[i,j] = (C,M,Y,K)
    return CMYK

Thanks to Cris Luengo for pointing out further refactoring potential (pulling out 255/J), leading to a second variant. It takes ~1.6 ms

@numba.njit(parallel=True, error_model="numpy", fastmath=True)
def bgr2cmyk_v5(bgr_img):
    bgr_img = np.ascontiguousarray(bgr_img)
    (height, width) = bgr_img.shape[:2]
    CMYK = np.empty((height, width, 4), dtype=np.uint8)
    for i in numba.prange(height):
        for j in range(width):
            B,G,R = bgr_img[i,j] 
            J = np.uint8(max(R, G, B))
            Jinv = np.uint16((255*256) // J) # fixed point math
            K = np.uint8(255 - J)
            C = np.uint8(((J - R) * Jinv) >> 8)
            M = np.uint8(((J - G) * Jinv) >> 8)
            Y = np.uint8(((J - B) * Jinv) >> 8)
            CMYK[i,j] = (C,M,Y,K)
    return CMYK

This fixed point math causes floor rounding. For round-to-nearest, the expression must be ((J - R) * Jinv + 128) >> 8. That would cost a bit more time then (~1.8 ms).

What else?

I think that numba/LLVM didn’t apply SIMD here. Some investigation revealed that the Loop Vectorizer doesn’t like any of the instances it was asked to consider.

An OpenCL kernel might be even faster. OpenCL can run on CPUs.

Numba can also use CUDA.

Answered By: Christoph Rackwitz