Extracting data from binary file in python

Question:

I have an instrument (LumiCycle from Actimetrics) that records photon counts and writes binary (I think) files with all the data. It has a clunky software that allows you to manually read the file and export the data as .csv. I want to bypass the software with a python script that would extract the data I need from a batch of files but I got stuck on decoding the data file.

This is what the data looks like when exported by the software. The bit that I’m interested in is counts/sec

Date    Time (hr:min)   Time (days) counts/sec   baseline 
03/18/2022  15:06   0.62917 96.236  104.321
03/18/2022  15:16   0.63611 100.144 104.408
03/18/2022  15:26   0.64306 103.011 104.491
03/18/2022  15:36   0.65    108.001 104.556
03/18/2022  15:46   0.65694 110.415 104.668
03/18/2022  15:56   0.66389 107.923 104.791

Here’s a dropbox link to a file containing the binary and exported csv

I contacted the manufacturer about the structure of the file and got this answer:

"Each record is an array of 3 strings.

Arrays are prefixed by a U32 integer containing the number of elements in the array.
Strings are prefixed by a U32 integer containing the length of the string in bytes.
integers are stored Big-endian.

The first string contains an array of U32 integers. The number of elements (as specified by the first U32) depends on what version of the program you are using, but the first two are the ones that are most likely relevant to you. The elements are:

 Counts
 Duration (ms)
 Counts 2 (non-zero if color recording)
 channel (1 - 32)
 Sec since midnight Jan 1 1904
 Temperature (°C) x 1000
 Color 1 (RGB) the color selected for displaying the counts
 Color 2 same
 Is Dark Counts? (0 in LC32)

Do not use the 5th element unless your data collection computer is set not to use Daylight Savings Time.

The second string is the date/time in text format.

The third string contains any comments specified during recording, so the length is usually 0."

So I read the file:

f = open("file", "rb")
byte = f.read()
while byte:
    print(byte)
    byte = f.read()

And I get something like this:

b'x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1aIx00x01x11"x00x00x00x00x00x00x00x13xdeZP`x00x00x97xb8x00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:06x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1bZx00x01x11x1fx00x00x00x00x00x00x00x13xdeZRxb8x00x00x97|x00x00x00x00x00x00x00x00x00x00x00nx00x00x00x1003/18/2022 15:16x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1cx1ex00x01x10xf4x00x00x00x00x00x00x00x13xdeZUx10x00x00x97x90x00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:26x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1dx80x00x01x11%x00x00x00x00x00x00x00x13xdeZWhx00x00x97xaex00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:36x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1e#x00x01x10xf1x00x00x00x00x00x00x00x13xdeZYxc0x00x00x97xc2x00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:46x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1dux00x01x10xf2x00x00x00x00x00x00x00x13xdeZ\x18x00x00x97xeax00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:56 etc...

I’m guessing each record is actually this:

x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1aIx00x01x11"x00x00x00x00x00x00x00x13xdeZP`x00x00x97xb8x00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:06
x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1bZx00x01x11x1fx00x00x00x00x00x00x00x13xdeZRxb8x00x00x97|x00x00x00x00x00x00x00x00x00x00x00nx00x00x00x1003/18/2022 15:16
x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1cx1ex00x01x10xf4x00x00x00x00x00x00x00x13xdeZUx10x00x00x97x90x00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:26
x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1dx80x00x01x11%x00x00x00x00x00x00x00x13xdeZWhx00x00x97xaex00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:36
x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1e#x00x01x10xf1x00x00x00x00x00x00x00x13xdeZYxc0x00x00x97xc2x00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:46
x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1dux00x01x10xf2x00x00x00x00x00x00x00x13xdeZ\x18x00x00x97xeax00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:56

I tried the bytes.hex() method and I get this if I run it on each record:

000000000000004800000003000000280000000900001b5a0001111f0000000000000013de5a52b80000977c00000000000000000000000a00000010
000000000000004800000003000000280000000900001c1e000110f40000000000000013de5a55100000979000000000000000000000000b00000010                                                          
000000000000004800000003000000280000000900001d80000111250000000000000013de5a5768000097ae00000000000000000000000b00000010
000000000000004800000003000000280000000900001e23000110f10000000000000013de5a59c0000097c200000000000000000000000b00000010
000000000000004800000003000000280000000900001d75000110f20000000000000013de5a5c18000097ea00000000000000000000000b00000010

But I still don’t know where is the actual data I need and why some of the integers are interpreted as chars. Any suggestion on methods or functions I can use to find the count/sec data in the file?

Asked By: Michal

||

Answers:

Use the struct module to unpack the data.
Your data looks like 6 records – deduced from the text timestamps

The second string is the date/time in text format.

one = b'x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1aIx00x01x11"x00x00x00x00x00x00x00x13xdeZP`x00x00x97xb8x00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:06x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1bZx00x01x11x1fx00x00x00x00x00x00x00x13xdeZRxb8x00x00x97|x00x00x00x00x00x00x00x00x00x00x00nx00x00x00x1003/18/2022 15:16x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1cx1ex00x01x10xf4x00x00x00x00x00x00x00x13xdeZUx10x00x00x97x90x00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:26x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1dx80x00x01x11%x00x00x00x00x00x00x00x13xdeZWhx00x00x97xaex00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:36x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1e#x00x01x10xf1x00x00x00x00x00x00x00x13xdeZYxc0x00x00x97xc2x00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:46x00x00x00x00x00x00x00Hx00x00x00x03x00x00x00(x00x00x00tx00x00x1dux00x01x10xf2x00x00x00x00x00x00x00x13xdeZ\x18x00x00x97xeax00x00x00x00x00x00x00x00x00x00x00x0bx00x00x00x1003/18/2022 15:56'

Arrays are prefixed by a U32 integer containing the number of elements in the array. Strings are prefixed by a U32 integer containing the length of the string in bytes. integers are stored Big-endian.

<first_record_length><first_string_length><first_string><text_timestamp>

The text timestamp field looks like it is fixed length – 16 characters.

They are implying that the record/array lengths might be variable. You might find that they are consistently the same if you are using the same instrument and firmware every time.

>>> # U32 integer sounds like an unsigned long - four bytes
>>> import struct
>>> length = '>L'
>>> (a,) = struct.unpack(length,one[:4])
>>> print(a)
72

The first array of three strings is 72 bytes long. If each of the six records is 72 bytes, that’s 432 bytes. The data is 452 bytes (len(one)) so that’s promising.

>>> first_record = one[:a]

The third string contains any comments specified during recording, so the length is usually 0.

The fields begin at the ninth byte of the record (<first_record_length><first_string_length><first_string><timestamp>). Assuming no comments and a fixed length time stamp at-the-end

>>> fields,timestamp = first_record[8:-16],first_record[-16:]
>>> timestamp
b'03/18/2022 15:06'
>>>

The fields are also unsigned longs (The first string contains an array of U32 integers). The mfg said that there are nine possible fields so 36 bytes (9 * 4 bytes) which doesn’t quite match up with the length of that fields variable but maybe there is some fluff.

>>> fields_fmt = '>9L'
>>> stuff = struct.unpack(fields_fmt,fields[:36])
>>> stuff
(40, 9, 6729, 69922, 0, 19, 3730460768, 38840, 0)

I cannot tell if that is correct. Your example csv has floats for values and the data is ints – there must be a conversion for each of the fields.


Finding the location of each timestamp in the data can give you a clue to how long each record is.

>>> import re
>>> for m in re.finditer(b'03/18/2022 15:dd',one):
...     y = m.end() - x
...     print(y)
...     x += y
... 
72
76
76
76
76
76

Turns out the format is just like mfg stated and it looks like each record is 76 bytes.

<record_length><field_length>< fields ><timestamp_length><text_timestamp><comment_length>
<  4 bytes    ><  4 bytes   ><44 bytes><    4 bytes     ><   16 bytes   ><   4 bytes    >

Each record in your data is actually 76 bytes. The data says each record is 72 bytes but there is a 4 bytes at the end of each (0000). I can’t tell if that is padding between records or the comment_length which is zero.

>>>> format = '>LL11LL16sL'
>>> struct.unpack(format,one[:76])         
(72, 3, 40, 9, 6729, 69922, 0, 19, 3730460768, 38840, 0, 0, 11, 16, b'03/18/2022 15:06', 0)
>>> struct.unpack(format,one[76*1:76+76*1])
(72, 3, 40, 9, 7002, 69919, 0, 19, 3730461368, 38780, 0, 0, 10, 16, b'03/18/2022 15:16', 0)
>>> struct.unpack(format,one[76*2:76+76*2]) 
(72, 3, 40, 9, 7198, 69876, 0, 19, 3730461968, 38800, 0, 0, 11, 16, b'03/18/2022 15:26', 0)
>>> struct.unpack(format,one[76*3:76+76*3]) 
(72, 3, 40, 9, 7552, 69925, 0, 19, 3730462568, 38830, 0, 0, 11, 16, b'03/18/2022 15:36', 0)
>>> struct.unpack(format,one[76*4:76+76*4]) 
(72, 3, 40, 9, 7715, 69873, 0, 19, 3730463168, 38850, 0, 0, 11, 16, b'03/18/2022 15:46', 0)

Your example data is missing the last 4 bytes of of the last record:

>>> struct.unpack('>LL11LL16s',one[76*5:])  
(72, 3, 40, 9, 7541, 69874, 0, 19, 3730463768, 38890, 0, 0, 11, 16, b'03/18/2022 15:56')
>>>

>>> record_length,field_length,*fields,ts_length,timestamp,comment_length = struct.unpack(format,one[:76])
>>> timestamp
b'03/18/2022 15:06'
>>> fields
[40, 9, 6729, 69922, 0, 19, 3730460768, 38840, 0, 0, 11]

struct format can be simplified to

'>14L16sL'
Answered By: wwii
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.