Python convert text file to pandas dataframe with multiline text

Question:

I have a protocol dump in a plain text file, which are in the following format:

Frame 380: 19 bytes on wire (152 bits), 19 bytes captured (152 bits)
Bluetooth HCI H4
    [Direction: Sent (0x00)]
    HCI Packet Type: ACL Data (0x02)
0000  02 0b 00 0e 00 0a 00 01 00 05 0e 06 00 07 07 00   ................
0010  00 00 00                                          ...
Frame 381: 8 bytes on wire (64 bits), 8 bytes captured (64 bits)
Bluetooth HCI H4
    [Direction: Rcvd (0x01)]
    HCI Packet Type: HCI Event (0x04)
0000  04 13 05 01 0b 00 01 00                           ........
Frame 382: 23 bytes on wire (184 bits), 23 bytes captured (184 bits)
Bluetooth HCI H4
    [Direction: Rcvd (0x01)]
    HCI Packet Type: ACL Data (0x02)
0000  02 0b 20 12 00 0e 00 01 00 05 12 0a 00 47 00 00   .. ..........G..
0010  00 00 00 01 02 00 04                              .......

In this simplified example, frame numbers 380, 381 and so on are part of the first line of each frame in text format. I want to convert this to a pandas dataframe in following form:

  FrameNumber                                   Details                                  
|---------------------------------------------------------------------------------------|
|            | Frame 380: 19 bytes on wire (152 bits), 19 bytes captured (152 bits)     |
|            | Bluetooth HCI H4                                                         |
|   380      |     [Direction: Sent (0x00)]                                             |
|            |     HCI Packet Type: ACL Data (0x02)                                     |
|            | 0000  02 0b 00 0e 00 0a 00 01 00 05 0e 06 00 07 07 00   ................ |
|            | 0010  00 00 00                                                           |
|---------------------------------------------------------------------------------------|
|            | Frame 381: 8 bytes on wire (64 bits), 8 bytes captured (64 bits)         |
|            | Bluetooth HCI H4                                                         |
|   381      |     [Direction: Rcvd (0x01)]                                             |
|            |     HCI Packet Type: HCI Event (0x04)                                    |
|            | 0000  04 13 05 01 0b 00 01 00                           ........         |
|---------------------------------------------------------------------------------------|
|            | Frame 382: 23 bytes on wire (184 bits), 23 bytes captured (184 bits)     |
|            | Bluetooth HCI H4                                                         |
|   382      |     [Direction: Rcvd (0x01)]                                             |
|            |     HCI Packet Type: ACL Data (0x02)                                     |
|            | 0000  02 0b 20 12 00 0e 00 01 00 05 12 0a 00 47 00 00   .. ..........G.. |
|            | 0010  00 00 00 01 02 00 04                              .......          |
+---------------------------------------------------------------------------------------+

I tried to use pandas read_csv(), but given my limited knowledge of multi-line regex selection I’m unable to solve the problem. Could anyone help me by coming up with a simple solution to this problem?

Asked By: pixelworks

||

Answers:

With extract and groupby :

df = pd.read_fwf("input2.txt", header=None, names=["Details"])

df["FrameNumber"] = (df["Details"].str.extract(r"(Frame d+)", expand=False)
                         .where(df["Details"].str.startswith(r"Frame")).ffill())

out = df.groupby("FrameNumber", as_index=False).agg("n".join)

Output :

+---------------+--------------------------------------------------------------------------+
| FrameNumber   | Details                                                                  |
|---------------+--------------------------------------------------------------------------|
| Frame 380     | Frame 380: 19 bytes on wire (152 bits), 19 bytes captured (152 bits)     |
|               | Bluetooth HCI H4                                                         |
|               | [Direction: Sent (0x00)]                                                 |
|               | HCI Packet Type: ACL Data (0x02)                                         |
|               | 0000  02 0b 00 0e 00 0a 00 01 00 05 0e 06 00 07 07 00   ................ |
|               | 0010  00 00 00                                          ...              |
| Frame 381     | Frame 381: 8 bytes on wire (64 bits), 8 bytes captured (64 bits)         |
|               | Bluetooth HCI H4                                                         |
|               | [Direction: Rcvd (0x01)]                                                 |
|               | HCI Packet Type: HCI Event (0x04)                                        |
|               | 0000  04 13 05 01 0b 00 01 00                           ........         |
| Frame 382     | Frame 382: 23 bytes on wire (184 bits), 23 bytes captured (184 bits)     |
|               | Bluetooth HCI H4                                                         |
|               | [Direction: Rcvd (0x01)]                                                 |
|               | HCI Packet Type: ACL Data (0x02)                                         |
|               | 0000  02 0b 20 12 00 0e 00 01 00 05 12 0a 00 47 00 00   .. ..........G.. |
|               | 0010  00 00 00 01 02 00 04                              .......          |
Answered By: Timeless

Another solution, using re module:

import re
import pandas as pd


all_data = []
with open("data.txt", "r") as f_in:
    for (g, n) in re.findall(
        r"^(Frame (d+).*?)s*(?=^Frame d+|Z)", f_in.read(), flags=re.M | re.S
    ):
        all_data.append({"FrameNumber": int(n), "Details": g})

df = pd.DataFrame(all_data)
print(df)

Prints:

|    |   FrameNumber | Details                                                                  |
|---:|--------------:|:-------------------------------------------------------------------------|
|  0 |           380 | Frame 380: 19 bytes on wire (152 bits), 19 bytes captured (152 bits)     |
|    |               | Bluetooth HCI H4                                                         |
|    |               |     [Direction: Sent (0x00)]                                             |
|    |               |     HCI Packet Type: ACL Data (0x02)                                     |
|    |               | 0000  02 0b 00 0e 00 0a 00 01 00 05 0e 06 00 07 07 00   ................ |
|    |               | 0010  00 00 00                                          ...              |
|  1 |           381 | Frame 381: 8 bytes on wire (64 bits), 8 bytes captured (64 bits)         |
|    |               | Bluetooth HCI H4                                                         |
|    |               |     [Direction: Rcvd (0x01)]                                             |
|    |               |     HCI Packet Type: HCI Event (0x04)                                    |
|    |               | 0000  04 13 05 01 0b 00 01 00                           ........         |
|  2 |           382 | Frame 382: 23 bytes on wire (184 bits), 23 bytes captured (184 bits)     |
|    |               | Bluetooth HCI H4                                                         |
|    |               |     [Direction: Rcvd (0x01)]                                             |
|    |               |     HCI Packet Type: ACL Data (0x02)                                     |
|    |               | 0000  02 0b 20 12 00 0e 00 01 00 05 12 0a 00 47 00 00   .. ..........G.. |
|    |               | 0010  00 00 00 01 02 00 04                              .......          |
Answered By: Andrej Kesely