Split a binary file into 4-byte sequences (python)
Question:
So, basically, I have a 256MB file which contains "int32" numbers written as 4-byte sequences, and I have to sort them into another file.
The thing I struggle with the most is how I read the file into an array of 4-byte sequences. First I thought this method is slow because it reads elements one by one:
for i in range(numsPerFile):
buffer = currFile.read(4)
arr.append(buffer)
Then I made this:
buffer = currFile.read()
arr4 = []
for i in range(numsPerFile):
arr4.append(bytes(buffer[i*4 : i*4+4]))
And it wasn’t any better when I measured the time (both read 128000 numbers in ~0.8 sec on my pc). So, is there a faster method to do this?
Answers:
def bindata():
array = []
count = 0
with open('file.txt', 'r') as f:
while count < 64:
f.seek(count)
array.append(f.read(8))
count = count + 8
print(array)
bindata()
file.txt data: FFFFFFFFAAAAAAAABBBBBBBBCCCCCCCCDDDDDDDDAAAAAAAAEEEEEEEEFFFFFFFF
Read and write tasks are often the slow part of an activity.
I have done some tests and removed the reading from file activity from the timings.
- Test0 was repeating the test you did to calibrate it for my old slow
machine.
- Test1 uses list comprehension to build the list of four byte chunks
the same as test0
- Test2 builds a list of integers. As you mentioned the data was
int32
I’ve turned the four byte chunks in to integers to be used in Python.
Test2 was the fastest, followed by test1, and test0 was always the slowest on my machine.
I only created 128MB file for test but it should give you an idea.
This is the code I used for my testing:
import time
from pathlib import Path
import secrets
import struct
tmp_txt = Path('/tmp/test.txt')
def gen_test():
data = secrets.token_bytes(128_000_000)
tmp_txt.write_bytes(data)
print(f"Generated length: {len(data):,}")
def test0(data):
arr4 = []
for i in range(int(len(data)/4)):
arr4.append(bytes(data[i*4:i*4+4]))
return arr4
def test1(data):
return [data[i:i+4] for i in range(0, len(data), 4)]
def test2(data):
return [_[0] for _ in struct.iter_unpack('>i', data)]
def run_test():
raw_data = tmp_txt.read_bytes()
for idx, test in enumerate([test0, test1, test2]):
print(f"test{idx}")
n0 = time.perf_counter()
x0 = test(raw_data)
n1 = time.perf_counter()
print(f"tTest{idx} took {n1 - n0:.2f}")
print(f"tdata[{len(x0):,}] = [{x0[0]}, {x0[1]} ...]")
if __name__ == '__main__':
gen_test()
run_test()
This gave me the following transcript:
Generated length: 128,000,000
test0
Test0 took 10.57
data[32,000,000] = [b'Zxb0xac]', b'xbbhxe0xda' ...]
test1
Test1 took 5.01
data[32,000,000] = [b'Zxb0xac]', b'xbbhxe0xda' ...]
test2
Test2 took 3.34
data[32,000,000] = [1521527901, -1150754598 ...]
I think I must have misunderstood your question. I can create a test file hopefully similar to yours like this:
import numpy as np
# Make 256MB array
a = np.random.randint(0,1000,int(256*1024*1024/4, dtype=np.uint32)
# Write to disk
a.tofile('BigBoy.bin')
And now time how long it takes to read it from disk
%timeit b = np.fromfile('BigBoy.bin', dtype=np.uint32)
That gives 33 milliseconds.
33.2 ms ± 336 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Check all the same
print(np.all(a==b))
True
Or, if for some reason, you don’t like Numpy, you can use a struct
:
d = open('BigBoy.bin', 'rb').read()
fmt = f'{len(d)//4}I'
data = struct.unpack(fmt, d)
So, basically, I have a 256MB file which contains "int32" numbers written as 4-byte sequences, and I have to sort them into another file.
The thing I struggle with the most is how I read the file into an array of 4-byte sequences. First I thought this method is slow because it reads elements one by one:
for i in range(numsPerFile):
buffer = currFile.read(4)
arr.append(buffer)
Then I made this:
buffer = currFile.read()
arr4 = []
for i in range(numsPerFile):
arr4.append(bytes(buffer[i*4 : i*4+4]))
And it wasn’t any better when I measured the time (both read 128000 numbers in ~0.8 sec on my pc). So, is there a faster method to do this?
def bindata():
array = []
count = 0
with open('file.txt', 'r') as f:
while count < 64:
f.seek(count)
array.append(f.read(8))
count = count + 8
print(array)
bindata()
file.txt data: FFFFFFFFAAAAAAAABBBBBBBBCCCCCCCCDDDDDDDDAAAAAAAAEEEEEEEEFFFFFFFF
Read and write tasks are often the slow part of an activity.
I have done some tests and removed the reading from file activity from the timings.
- Test0 was repeating the test you did to calibrate it for my old slow
machine. - Test1 uses list comprehension to build the list of four byte chunks
the same as test0 - Test2 builds a list of integers. As you mentioned the data was
int32
I’ve turned the four byte chunks in to integers to be used in Python.
Test2 was the fastest, followed by test1, and test0 was always the slowest on my machine.
I only created 128MB file for test but it should give you an idea.
This is the code I used for my testing:
import time
from pathlib import Path
import secrets
import struct
tmp_txt = Path('/tmp/test.txt')
def gen_test():
data = secrets.token_bytes(128_000_000)
tmp_txt.write_bytes(data)
print(f"Generated length: {len(data):,}")
def test0(data):
arr4 = []
for i in range(int(len(data)/4)):
arr4.append(bytes(data[i*4:i*4+4]))
return arr4
def test1(data):
return [data[i:i+4] for i in range(0, len(data), 4)]
def test2(data):
return [_[0] for _ in struct.iter_unpack('>i', data)]
def run_test():
raw_data = tmp_txt.read_bytes()
for idx, test in enumerate([test0, test1, test2]):
print(f"test{idx}")
n0 = time.perf_counter()
x0 = test(raw_data)
n1 = time.perf_counter()
print(f"tTest{idx} took {n1 - n0:.2f}")
print(f"tdata[{len(x0):,}] = [{x0[0]}, {x0[1]} ...]")
if __name__ == '__main__':
gen_test()
run_test()
This gave me the following transcript:
Generated length: 128,000,000
test0
Test0 took 10.57
data[32,000,000] = [b'Zxb0xac]', b'xbbhxe0xda' ...]
test1
Test1 took 5.01
data[32,000,000] = [b'Zxb0xac]', b'xbbhxe0xda' ...]
test2
Test2 took 3.34
data[32,000,000] = [1521527901, -1150754598 ...]
I think I must have misunderstood your question. I can create a test file hopefully similar to yours like this:
import numpy as np
# Make 256MB array
a = np.random.randint(0,1000,int(256*1024*1024/4, dtype=np.uint32)
# Write to disk
a.tofile('BigBoy.bin')
And now time how long it takes to read it from disk
%timeit b = np.fromfile('BigBoy.bin', dtype=np.uint32)
That gives 33 milliseconds.
33.2 ms ± 336 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Check all the same
print(np.all(a==b))
True
Or, if for some reason, you don’t like Numpy, you can use a struct
:
d = open('BigBoy.bin', 'rb').read()
fmt = f'{len(d)//4}I'
data = struct.unpack(fmt, d)