Split a binary file into 4-byte sequences (python)

Question:

So, basically, I have a 256MB file which contains "int32" numbers written as 4-byte sequences, and I have to sort them into another file.

The thing I struggle with the most is how I read the file into an array of 4-byte sequences. First I thought this method is slow because it reads elements one by one:

for i in range(numsPerFile):
    buffer = currFile.read(4)
    arr.append(buffer)

Then I made this:

buffer = currFile.read()
arr4 = []
for i in range(numsPerFile):
    arr4.append(bytes(buffer[i*4 : i*4+4]))

And it wasn’t any better when I measured the time (both read 128000 numbers in ~0.8 sec on my pc). So, is there a faster method to do this?

Asked By: Swif

||

Answers:

def bindata():
  array = []
  count = 0
  with open('file.txt', 'r') as f:
    while count < 64:
      f.seek(count)
      array.append(f.read(8))
      count = count + 8
  print(array)
    
bindata()

file.txt data: FFFFFFFFAAAAAAAABBBBBBBBCCCCCCCCDDDDDDDDAAAAAAAAEEEEEEEEFFFFFFFF

Answered By: user20153225

Read and write tasks are often the slow part of an activity.

I have done some tests and removed the reading from file activity from the timings.

  • Test0 was repeating the test you did to calibrate it for my old slow
    machine.
  • Test1 uses list comprehension to build the list of four byte chunks
    the same as test0
  • Test2 builds a list of integers. As you mentioned the data was
    int32 I’ve turned the four byte chunks in to integers to be used in Python.

Test2 was the fastest, followed by test1, and test0 was always the slowest on my machine.

I only created 128MB file for test but it should give you an idea.

This is the code I used for my testing:

import time
from pathlib import Path
import secrets
import struct

tmp_txt = Path('/tmp/test.txt')


def gen_test():
    data = secrets.token_bytes(128_000_000)
    tmp_txt.write_bytes(data)
    print(f"Generated length: {len(data):,}")


def test0(data):
    arr4 = []
    for i in range(int(len(data)/4)):
        arr4.append(bytes(data[i*4:i*4+4]))
    return arr4


def test1(data):
    return [data[i:i+4] for i in range(0, len(data), 4)]


def test2(data):
    return [_[0] for _ in struct.iter_unpack('>i', data)]


def run_test():
    raw_data = tmp_txt.read_bytes()
    for idx, test in enumerate([test0, test1, test2]):
        print(f"test{idx}")
        n0 = time.perf_counter()
        x0 = test(raw_data)
        n1 = time.perf_counter()
        print(f"tTest{idx} took {n1 - n0:.2f}")
        print(f"tdata[{len(x0):,}] = [{x0[0]}, {x0[1]} ...]")


if __name__ == '__main__':
    gen_test()
    run_test()

This gave me the following transcript:

Generated length: 128,000,000
test0
    Test0 took 10.57
    data[32,000,000] = [b'Zxb0xac]', b'xbbhxe0xda' ...]
test1
    Test1 took 5.01
    data[32,000,000] = [b'Zxb0xac]', b'xbbhxe0xda' ...]
test2
    Test2 took 3.34
    data[32,000,000] = [1521527901, -1150754598 ...]
Answered By: ukBaz

I think I must have misunderstood your question. I can create a test file hopefully similar to yours like this:

import numpy as np
# Make 256MB array
a = np.random.randint(0,1000,int(256*1024*1024/4, dtype=np.uint32)

# Write to disk
a.tofile('BigBoy.bin')

And now time how long it takes to read it from disk

%timeit b = np.fromfile('BigBoy.bin', dtype=np.uint32)

That gives 33 milliseconds.

33.2 ms ± 336 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Check all the same
print(np.all(a==b))
True 

Or, if for some reason, you don’t like Numpy, you can use a struct:

d = open('BigBoy.bin', 'rb').read()
fmt = f'{len(d)//4}I'
data = struct.unpack(fmt, d)
Answered By: Mark Setchell
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.