Python inequality join with group by

Question:

I have the following two dataframes

import pandas as pd

dates = ['31-12-2015', '31-12-2016', '31-12-2017', '31-12-2018']
df1 = pd.DataFrame({'id': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
                    't': dates*4,
                    'stage': [1,2,2,3,1,1,2,3,1,1,1,3,2,1,1,3]})

df2 = df1.loc[df1['stage'] == 1]

What is the most efficient way of doing the operation below in python?

Select a.id
      ,a.t
      ,max(b.stage = 2) as flag 
From df2 as a
Left join df1 as b
  On a.id = b.id and a.t < b.t
Group by 1,2
Asked By: Serge Kashlik

||

Answers:

This should help you out:

import pandas as pd

# Your data

dates = ['31-12-2015', '31-12-2016', '31-12-2017', '31-12-2018']
df1 = pd.DataFrame({'id': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
                    't': dates*4,
                    'stage': [1,2,2,3,1,1,2,3,1,1,1,3,2,1,1,3]})

# Filter stage 1

df2 = df1.loc[df1['stage'] == 1]

# Merge df2 and df1 on 'id' with a left join

merged_df = df2.merge(df1, on='id', how='left', suffixes=('', '_b'))

# Filter rows where a.t < b.t

filtered_df = merged_df[merged_df['t'] < merged_df['t_b']]

# Group by 'id' and 't', and create a flag for max(b.stage = 2)

result = filtered_df.groupby(['id', 't']).agg(flag=('stage_b', lambda x: (x == 2).max())).reset_index()

print(result)

Answered By: cconsta1
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.