Split a long byte array into numpy array of strings

Question:

Usually, when creating an numpy array of strings, we can do something like

import numpy as np
np.array(["Hello world!", "good bye world!", "whatever world"])
>>> array(['Hello world!', 'good bye world!', 'whatever world'], dtype='<U15')

Now the question is, I am given a long bytearray from a foreign C function like this:

b'Hello world!x00<some rubbish bytes>good bye world!x00<some rubbish bytes>whatever worldx00<some rubbish bytes>'

It is guaranteed that every 32 bytes is a null-terminated string (i.e., there is a x00 byte appended to the valid part of the string) and I need to convert this long bytearray to something like this, array(['Hello world!', 'good bye world!', 'whatever world'], dtype='<U15'), preferably in-place (i.e., no memory copy).

This is what I do now:

for i in range(str_count):
    str_arr[i] = byte_arr[i * 32: (i+1) * 32].split(b'x00')[0].decode('utf-8')
str_arr_np = np.array(str_arr),

It works, but it is kind of awkward and not done in-place (bytes are copied at least once, if not twice). Are there any better approaches?

Asked By: D.J. Elkind

||

Answers:

If you can zero out the data on the C side, then you can use np.frombuffer and it will be about as efficient as you can reasonably expect:

So, if you can zero out the data, then this can be read using numpy.frombuffer and it will probably be as efficient as you can reasonably expect to get:

>>> raw = b'hello worldx00x00x00x00x00Good Byex00x00x00x00x00x00x00x00'
>>> np.frombuffer(raw, dtype='S16')
array([b'hello world', b'Good Bye'], dtype='|S16')

Of course, this gives you a bytes string, not unicode string, although, that may be desirable in your case.

Note, the above relies on the built-in behavior of stripping trailing null bytes, if you have garbage afterwards, it won’t work:

>>> data = b'hello worldx00aaaaGood Byex00x00x00x00x00x00x00x00'
>>> np.frombuffer(data, dtype='S16')
array([b'hello worldx00aaaa', b'Good Bye'], dtype='|S16')

Note, this shouldn’t make a copy, notice:

>>> arr = np.frombuffer(raw, dtype='S16')
>>> arr
array([b'hello world', b'Good Bye'], dtype='|S16')
>>> arr[0] = b"z"*16
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: assignment destination is read-only

However, if the destination is not read-only, so say you had a bytearray to begin with:

>>> raw = bytearray(raw)
>>> arr = np.frombuffer(raw, dtype='S16')
>>> arr[0] = b"z"*16
>>> arr
array([b'zzzzzzzzzzzzzzzz', b'Good Bye'], dtype='|S16')
>>> raw
bytearray(b'zzzzzzzzzzzzzzzzGood Byex00x00x00x00x00x00x00x00')
Answered By: juanpa.arrivillaga
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.