Filtering byte stream efficiently before converting to numpy array / pandas dataframe

Question

I’m looking for guidance on how to efficiently filter out unneeded parts of my data before converting to a numpy array and/or pandas dataframe. Data is delivered to my program as string buffers (each record separately), and I’m currently using np.frombuffer to construct an array once all records are retrieved.

The problem I’m having is that individual records can be quite long, thousands of fields, and sometimes I only want a small subset of them. Filtering out these unneeded fields adds steps and significantly slows down the data import though.

Without any filtering, my current process is:

# assume some function here that retrieves one record at a time and appends it to 'data'

data = [b'x00x00x00x00x00x00xf0?one     x00x00x00x00x00x00Y@',
        b'x00x00x00x00x00x00x00@two     x00x00x00x00x00x00i@',
        b'x00x00x00x00x00x00x08@three   x00x00x00x00x00xc0r@',
        b'x00x00x00x00x00x00x10@four    x00x00x00x00x00x00y@']

final_data = b''.join(data)

arr = np.frombuffer(final_data, dtype=struct_dtypes)
df = pd.DataFrame(arr)

# dataframe
    n1     ch     n2
0  1.0    one  100.0
1  2.0    two  200.0
2  3.0  three  300.0
3  4.0   four  400.0

My current solution for filtering is essentially:

final_data = b''.join(b''.join(buffer[offset: offset + 8] for offset in [0, 16]) for buffer in data)

struct_dtypes = np.dtype([('n1', 'd'), ('n2', 'd')])
arr = np.frombuffer(final_data, dtype=struct_dtypes)
df = pd.DataFrame(arr)

    n1     n2
0  1.0  100.0
1  2.0  200.0
2  3.0  300.0
3  4.0  400.0

That middle step to slice and rejoin each record makes filtering slower than just reading everything. If I construct the full array first and then return only the specified columns, isn’t that just a waste of memory? What’s an appropriate way to read only the portions of the string buffers I want?

Update using accepted answer

struct_dtypes = np.dtype({'names': ['n1', 'ch'],
                          'formats': ['d', '8V'],
                          'offsets': [0, 8],
                          'itemsize': 24})

final_data = b''.join(data)

arr = np.frombuffer(final_data, dtype=struct_dtypes)

Asked By: StevenS

||

Source

Answer 1

You can specify an offset for each field during dtype construction:

struct_dtypes = np.dtype({'names': ['n1', 'n2'], 'formats': ['d', 'd'], 'offsets': [0, 16]})

or

struct_dtypes = np.dtype({'n1': ('d', 0), 'n2': ('d', 16)})

Update (see comments below):
If you don’t read the last element in the record, you need to specify the itemsize:

struct_dtypes = np.dtype({'names': ['n1', 'ch'],
                          'formats': ['d', '8V'],
                          'offsets': [0, 8],
                          'itemsize': 24})

Answered By: Stef

Filtering byte stream efficiently before converting to numpy array / pandas dataframe

Question:

Answers: