Pandas group events close together by date, then test if other values are equal

Question:

Based on the date of disease, and an address, I am looking for disease outbreaks that occur at the same location within specified timeframe of each other. The dataframe is large – 300K rows.

There is a great solution to match dates within a specified number of days before or after a date in each row by jezrael (I’m not sure this can process 300K rows):

import pandas as pd

df = pd.DataFrame(
    [
        ['2020-01-01 10:00', '1', 'A'],
        ['2020-01-01 10:01', '2', 'A'],
        ['2020-01-01 10:02', '3a', 'A'],
        ['2020-01-01 10:02', '3b', 'B'],
        ['2020-01-01 10:30', '4', 'B'],
        ['2020-01-01 10:50', '5', 'B'],
        ['2020-01-01 10:54', '6', 'B'],
        ['2020-01-01 10:55', '7', 'B'],
    ], columns=['event_time', 'event_id', 'Address']
)

# solution matching dates within range of date in row by jezrael
df['event_time'] = pd.to_datetime(df['event_time'])

td = pd.Timedelta("1m")
f = lambda x, y: df.loc[df['event_time'].between(y - td, y + td),
                        'event_id'].drop(x).tolist()
df['related_event_id_list'] = [f(k, v) for k, v in df['event_time'].items()]
print (df)
           event_time event_id related_event_id_list  Address
0 2020-01-01 10:00:00        1                   [2]     A
1 2020-01-01 10:01:00        2           [1, 3a, 3b]     A
2 2020-01-01 10:02:00       3a               [2, 3b]     A
3 2020-01-01 10:02:00       3b               [2, 3a]     B
4 2020-01-01 10:30:00        4                    []     B
5 2020-01-01 10:50:00        5                    []     B
6 2020-01-01 10:54:00        6                   [7]     B
7 2020-01-01 10:55:00        7                   [6]     B

I’ve tried unsuccessfully to include the address in the original comparison. I’m not sure how I would compare Addresses between the entire related_event_id_list (?) OR if it would be better to match the addresses first (reducing the number of rows), and then adapt the jezrael solution with the output?

The output should allow me to count events with start date, end date, and address. Adapting the jezrael solution, as a start, it would be:

           event_time event_id related_event_id_list  Address
0 2020-01-01 10:00:00        1                   [2]     A
1 2020-01-01 10:01:00        2               [1, 3a]     A
2 2020-01-01 10:02:00       3a                   [2]     A
3 2020-01-01 10:02:00       3b                    []     B
4 2020-01-01 10:30:00        4                    []     B
5 2020-01-01 10:50:00        5                    []     B
6 2020-01-01 10:54:00        6                   [7]     B
7 2020-01-01 10:55:00        7                   [6]     B

But because the first three rows (and last two rows) represent a continuous outbreak, the solution would really be more like:

     event_time_start  event_time_end     events_and_related_event_id_list  Address
0 2020-01-01 10:00:00  2020-01-01 10:02:00        [1, 2, 3a]     A
6 2020-01-01 10:54:00  2020-01-01 10:55:00        [6, 7]         B
Asked By: DrWhat

||

Answers:

You can use thi solution per groups:

# solution matching dates within range of date in row by jezrael
df['event_time'] = pd.to_datetime(df['event_time'])

def f(g):
    td = pd.Timedelta("1m")
    f = lambda x, y: g.loc[g['event_time'].between(y - td, y + td),
                            'event_id'].drop(x).tolist()
    g['related_event_id_list'] = [f(k, v) for k, v in g['event_time'].items()]
    return g

df  = df.groupby('Address').apply(f)
print (df)
           event_time event_id Address related_event_id_list
0 2020-01-01 10:00:00        1       A                   [2]
1 2020-01-01 10:01:00        2       A               [1, 3a]
2 2020-01-01 10:02:00       3a       A                   [2]
3 2020-01-01 10:02:00       3b       B                    []
4 2020-01-01 10:30:00        4       B                    []
5 2020-01-01 10:50:00        5       B                    []
6 2020-01-01 10:54:00        6       B                   [7]
7 2020-01-01 10:55:00        7       B                   [6]

For next step use GroupBy.agg for groups created by consecutive non empty list values in related_event_id_list column:

m = df['related_event_id_list'].astype(bool)

f1 = lambda x: list(dict.fromkeys([z for y in x for z in y]))

df = (df[m].groupby([(~m).cumsum(),'Address'])
           .agg(event_time_start=('event_time','min'),
                event_time_end=('event_time','max'),
                events_and_related_event_id_list=('related_event_id_list',f1))
           .droplevel(0)
           .reset_index())
print (df)
  Address    event_time_start      event_time_end  
0       A 2020-01-01 10:00:00 2020-01-01 10:02:00   
1       B 2020-01-01 10:54:00 2020-01-01 10:55:00   

  events_and_related_event_id_list  
0                       [2, 1, 3a]  
1                           [7, 6]  
Answered By: jezrael

You can use numpy broadcast to do the aggregation operation:

def find_related_event(df):
    evt = df['event_time'].values
    out = np.abs(evt[:, None] - evt) <= pd.Timedelta('1m')
    out[np.diag_indices(out.shape[0])] = False
    df1 = df.loc[out.any(axis=1)]
    return pd.Series({'index': df1.index[0],
        'event_time_start': df1['event_time'].iloc[0],
        'event_time_stop': df1['event_time'].iloc[-1],
        'events_and_related_event_id_list': df1['event_id'].tolist()
    })

out = (df.groupby('Address', as_index=False).apply(find_related_event)
         .set_index('index').rename_axis(None)

Output:

>>> out
  Address    event_time_start     event_time_stop events_and_related_event_id_list
0       A 2020-01-01 10:00:00 2020-01-01 10:02:00                       [1, 2, 3a]
6       B 2020-01-01 10:54:00 2020-01-01 10:55:00                           [6, 7]

Alternative

def find_related_event(evt):
    out = np.abs(evt.values[:, None] - evt.values) <= pd.Timedelta('1m')
    out[np.diag_indices(out.shape[0])] = False
    return out.any(axis=1)

m = df.groupby('Address')['event_time'].transform(find_related_event)
out = df.loc[m].groupby('Address', as_index=False).agg(
            event_time_start=('event_time', 'first'),
            event_time_stop=('event_time', 'first'),
            events_and_related_event_id_list=('event_id', list)
      )
Answered By: Corralien
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.