Faster bit-level data packing

Question:

An 256*64 pixel OLED display connected to Raspberry Pi (Zero W) has 4 bit greyscale pixel data packed into a byte (i.e. two pixels per byte), so 8192 bytes in total. E.g. the bytes

0a 0b 0c 0d (only lower nibble has data)

become

ab cd

Converting these bytes either obtained from a Pillow (PIL) Image or a cairo ImageSurface takes up to 0.9 s when naively iterating the pixel data, depending on color depth.

Combining every two bytes from a Pillow “L” (monochrome 8 bit) Image:

imd = im.tobytes()
nibbles = [int(p / 16) for p in imd]
packed = []
msn = None
for n in nibbles:
    nib = n & 0x0F
    if msn is not None:
        b = msn << 4 | nib
        packed.append(b)
        msn = None
    else:
        msn = nib

This (omitting state and saving float/integer conversion) brings it down to about half (0.2 s):

packed = []
for b in range(0, 256*64, 2):
    packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )

Basically the first applied to an RGB24 (32 bit!) cairo ImageSurface, though with crude greyscale conversion:

mv = surface.get_data()
w = surface.get_width()
h = surface.get_height()
f = surface.get_format()
s = surface.get_stride()
print(len(mv), w, h, f, s)

# convert xRGB
o = []
msn = None
for p in range(0, len(mv), 4):
    nib = int( (mv[p+1] + mv[p+2] + mv[p+3]) / 3 / 16) & 0x0F
    if msn is not None:
        b = msn << 4 | nib
        o.append(b)
        msn = None
    else:
        msn = nib

takes about twice as long (0.9 s vs 0.4 s).

The struct module does not support nibbles (half-bytes).

bitstring does allow packing nibbles:

>>> a = bitstring.BitStream()
>>> a.insert('0xf')
>>> a.insert('0x1')
>>> a
BitStream('0xf1')
>>> a.insert(5)
>>> a
BitStream('0b1111000100000')
>>> a.insert('0x2')
>>> a
BitStream('0b11110001000000010')
>>>

But there does not seem to be a method to unpack this into a list of integers quickly — this takes 30 seconds!:

a = bitstring.BitStream()
for p in imd:
    a.append( bitstring.Bits(uint=p//16, length=4) )

packed=[]
a.pos=0
for p in range(256*64//2):
    packed.append( a.read(8).uint )

Does Python 3 have the means to do this efficiently or do I need an alternative?
External packer wrapped with ctypes? The same, but simpler, with Cython (I have not yet looked into these)? Looks very good, see my answer.

Asked By: handle

||

Answers:

Down to 130 ms from 200 ms by just wrapping the loop in a function

def packer0(imd):
    """same loop in a def"""
    packed = []
    for b in range(0, 256*64, 2):
        packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
    return packed

Down to 35 ms by Cythonizing the same code

def packer1(imd):
    """Cythonize python nibble packing loop"""
    packed = []
    for b in range(0, 256*64, 2):
        packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
    return packed

Down to 16 ms with type

def packer2(imd):
    """Cythonize python nibble packing loop, typed"""
    packed = []
    cdef unsigned int b
    for b in range(0, 256*64, 2):
        packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
    return packed

Not much of a difference with a "simplified" loop

def packer3(imd):
    """Cythonize python nibble packing loop, typed"""
    packed = []
    cdef unsigned int i
    for i in range(256*64/2):
        packed.append( (imd[i*2]//16)<<4 | (imd[i*2+1]//16) )
    return packed

Maybe a tiny bit faster even (15 ms)

def packer4(it):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]

Here’s with timeit

>>> timeit.timeit('packer4(data)', setup='from pack import packer4; data = [0]*256*64', number=100)
1.31725951000044
>>> exit()
pi@raspberrypi:~ $ python3 -m timeit -s 'from pack import packer4; data = [0]*256*64' 'packer4(data)'
100 loops, best of 3: 9.04 msec per loop

This already meets my requirements, but I guess there may be further optimization possible with the input/output iterables (-> unsigned int array?) or accessing the input data with a wider data type (Raspbian is 32 bit, BCM2835 is ARM1176JZF-S single-core).

Or with parallelism on the GPU or the multi-core Raspberry Pis.


A crude comparison with the same loop in C (ideone):

#include <stdio.h>
#include <stdint.h>
#define SIZE (256*64)
int main(void) {
  uint8_t in[SIZE] = {0};
  uint8_t out[SIZE/2] = {0};
  uint8_t t;
  for(t=0; t<100; t++){
    uint16_t i;
    for(i=0; i<SIZE/2; i++){
        out[i] = (in[i*2]/16)<<4 | in[i*2+1]/16;
    }
  }
  return 0;
}

It’s apparently 100 times faster:

pi@raspberry:~ $ gcc p.c
pi@raspberry:~ $ time ./a.out

real    0m0.085s
user    0m0.060s
sys     0m0.010s

Eliminating the the shifts/division may be another slight optimization (I have not checked the resulting C, nor the binary):

def packs(bytes it):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    return [ ( (it[i<<1]&0xF0) | (it[(i<<1)+1]>>4) ) for i in range(n) ]

results in

python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 12.7 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 12 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 11 msec per loop
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 13.9 msec per loop
Answered By: handle
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.