Pandas group check if timestamp for certain value is earlier than timestamp for other certain value

Question:

I’m looking at a dataframe full of orders, which have timestamps, order is and an operation status for inserting/updating/canceling the order. The dataframe looks as follows:

Index TIMESTAMP ORDER_ID OPERATION
0 2022-09-01 13:49:00.000 123 INSERT
1 2022-09-01 13:49:25.000 123 UPDATE
2 2022-09-01 13:49:37.000 123 CANCEL
3 2022-09-01 13:49:48.000 456 CANCEL
4 2022-09-01 13:49:49.000 456 UPDATE
5 2022-09-01 13:49:50.000 456 INSERT

As part of my data analysis, I would like to clean up the data to filter out "flawed" orders.
I was able to filter out operations that are missing an insert/update/cancel using df.groupby('ORDER_ID').filter(lambda x: (x.OPERATION == "CANCEL").any()) (repeating the code with the different operations, not efficient).

Next, I would like to filter out those orders in which the "ordering" (i.e., time when they occur) of the operations is wrong, e.g., when a CANCEL (or UPDATE) occurs before an INSERT.

How could I filter out those cases? It should check the TIMESTAMP of an CANCEL operation for that order id group, and check if that timestamp is earlier than the TIMESTAMP of an INSERT operation. Ideally, the solution should be handle cases when there are multiple of the same OPERATION for the same order ID, e.g., in some cases the OPERATIONS are occuring more than once (at different timestamps).

My idea so far:
df.groupby(["ORDER_ID", "OPERATION"])['TIMESTAMP'].first() gives me a table that shows the first occurences of each different OPERATION for the ORDER_IDs. Now I would just need some way to compare those returned values inside.

Asked By: Olorun

||

Answers:

Convert column to datetimes, replace not matched datimes to missing values in Series.where and get minimal datetime per groups, last compare both columns:

df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'])

df1 = (df.assign(s1 = df['TIMESTAMP'].where(df['OPERATION'].eq('INSERT')),
                 s2 = df['TIMESTAMP'].where(df['OPERATION'].isin(['UPDATE','CANCEL'])))
          .groupby('ORDER_ID')[['s1','s2']]
          .transform('min'))


df = df[df1.s1.lt(df1.s2)]
print (df)
            TIMESTAMP  ORDER_ID OPERATION
0 2022-09-01 13:49:00       123    INSERT
1 2022-09-01 13:49:25       123    UPDATE
2 2022-09-01 13:49:37       123    CANCEL
Answered By: jezrael
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.