# Python inequality join with group by

## Question:

I have the following two dataframes

``````import pandas as pd

dates = ['31-12-2015', '31-12-2016', '31-12-2017', '31-12-2018']
df1 = pd.DataFrame({'id': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
't': dates*4,
'stage': [1,2,2,3,1,1,2,3,1,1,1,3,2,1,1,3]})

df2 = df1.loc[df1['stage'] == 1]
``````

What is the most efficient way of doing the operation below in python?

``````Select a.id
,a.t
,max(b.stage = 2) as flag
From df2 as a
Left join df1 as b
On a.id = b.id and a.t < b.t
Group by 1,2
``````

``````import pandas as pd

dates = ['31-12-2015', '31-12-2016', '31-12-2017', '31-12-2018']
df1 = pd.DataFrame({'id': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
't': dates*4,
'stage': [1,2,2,3,1,1,2,3,1,1,1,3,2,1,1,3]})

# Filter stage 1

df2 = df1.loc[df1['stage'] == 1]

# Merge df2 and df1 on 'id' with a left join

merged_df = df2.merge(df1, on='id', how='left', suffixes=('', '_b'))

# Filter rows where a.t < b.t

filtered_df = merged_df[merged_df['t'] < merged_df['t_b']]

# Group by 'id' and 't', and create a flag for max(b.stage = 2)

result = filtered_df.groupby(['id', 't']).agg(flag=('stage_b', lambda x: (x == 2).max())).reset_index()

print(result)

``````
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.