extract event pairs from multiline text

Question:

I would like to extract event pair (start and end marked by + and -). but the pairs maybe not match which means start happen two times then followed the end event.

In below example, event B start happed 2 times, so I wish it output a mismatched pair with nil in the end event not found.

import re
import pandas as pd

data = """
00:00:00 +running A
dummy data
00:00:01 -running
00:00:02 +running B
dummy data
00:00:03 +running B
00:00:04 -running
00:00:05 +running C
dummy data
00:00:06 -running
00:00:07 +running D
10:00:08 -running


"""
m = re.findall(r"(d+:d+:d+) +running (w+).*?(d+:d+:d+) -running",data,re.DOTALL)
print(len(m))
df = pd.DataFrame(m,columns=['ts1','name','ts2'])
print(df)

Current output:

        ts1 name       ts2
0  00:00:00    A  00:00:01
1  00:00:02    B  00:00:04
2  00:00:05    C  00:00:06
3  00:00:07    D  10:00:08

Expected:

        ts1 name       ts2
0  00:00:00    A  00:00:01
1  00:00:02    B  NA
2  00:00:03    B  00:00:04
3  00:00:05    C  00:00:06
4  00:00:07    D  10:00:08

What’s proper way to get such results in python? I do not care about if use findall or not.

Asked By: lucky1928

||

Answers:

Try:

import re

import pandas as pd

data = """

00:00:00 +running A
dummy data
00:00:01 -running
00:00:02 +running B
dummy data
00:00:03 +running B
00:00:04 -running
00:00:05 +running C
dummy data
00:00:06 -running
00:00:07 +running D
10:00:08 -running



"""


def get_columns(data):
    stack = []
    for time, val in data:
        if val.startswith("+"):
            stack.append((time, val.split()[-1]))
        elif val.startswith("-") and stack:
            t, v = stack.pop()
            yield t, v, time

    for time, val in stack:
        yield time, val, None


all_data = []
for line in map(str.strip, data.splitlines()):
    if not re.match(r"d+:d+:d+", line):
        continue
    all_data.append(line.split(maxsplit=1))

df = pd.DataFrame(get_columns(all_data), columns=["ts1", "name", "ts2"]).sort_values(
    "ts1"
)
print(df)

Prints:

        ts1 name       ts2
0  00:00:00    A  00:00:01
4  00:00:02    B      None
1  00:00:03    B  00:00:04
2  00:00:05    C  00:00:06
3  00:00:07    D  10:00:08

EDIT: Added re check that line must start with time pattern.

Answered By: Andrej Kesely

You can just modify your regex to have an optional trailing part:

data = """
00:00:00 +running A
dummy data
00:00:01 -running
00:00:02 +running B
dummy data
00:00:03 +running B
00:00:04 -running
00:00:05 +running C
dummy data
00:00:06 -running
00:00:07 +running D
10:00:08 -running
"""

m = re.findall(r"(d+:d+:d+) +running (w+)(?:n(d+:d+:d+) -running)?",
               data, re.DOTALL)
df = pd.DataFrame(m, columns=['ts1','name','ts2']).replace('', None)

NB. replacing .*? by n.

Output:

        ts1 name       ts2
0  00:00:00    A  00:00:01
1  00:00:02    B      None
2  00:00:03    B  00:00:04
3  00:00:05    C  00:00:06
4  00:00:07    D  10:00:08

regex demo

Handling dummy data

If you assume there could be arbitrary rows of data you could filter them out if they don’t match the expected pattern:

data = """
00:00:00 +running A
dummy data
00:00:01 -running
00:00:02 +running B
dummy data
00:00:03 +running B
00:00:04 -running
00:00:05 +running C
dummy data
00:00:06 -running
00:00:07 +running D
10:00:08 -running
"""

m = re.findall(r"(d+:d+:d+) +running (w+)(?:s*(d+:d+:d+) -running)?",
               'n'.join(x for x in data.splitlines() if
                         re.match(r'd+:d+:d+ [-+]running', x)),
               re.DOTALL)
df = pd.DataFrame(m,columns=['ts1','name','ts2']).replace('', None)

Or using a negative lookahead to handle the dummy part:

m = re.findall(r"(d+:d+:d+) +running (w+)(?:(?:.(?!d+:d+:d+))*n(d+:d+:d+) -running)?",
               data, re.DOTALL)
df = pd.DataFrame(m,columns=['ts1','name','ts2']).replace('', None)

regex demo

Output:

        ts1 name       ts2
0  00:00:00    A  00:00:01
1  00:00:02    B      None
2  00:00:03    B  00:00:04
3  00:00:05    C  00:00:06
4  00:00:07    D  10:00:08
Answered By: mozway
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.