DF1 is a subset of DF2 DF3 = DF2 – DF1 that will give rows that are not same store in df3
Question:
Dataframe 1 (df1):-
date L120_active_cohort_logins L120_active_cohort percentage_L120_active_cohort_logins
0 2022-09-01 32679 195345 16.728865
1 2022-09-02 32938 196457 16.766010
2 2022-09-03 40746 197586 20.621906
3 2022-09-04 33979 198799 17.092138
Dataframe 2(df2):-
date L120_active_cohort_logins L120_active_cohort percentage_L120_active_cohort_logins
0 2022-09-01 32677 195345 16.728864
1 2022-09-02 32938 196457 16.766010
2 2022-09-03 40746 197586 20.621906
3 2022-09-04 33979 198799 17.092138
result df3 = df2 – df1
I want df2 not matching with df1 particular row to be stored in df3
output :-
date L120_active_cohort_logins L120_active_cohort percentage_L120_active_cohort_logins
0 2022-09-01 32677 195345 16.728864
Answers:
reference Comparing two dataframes and getting the differences
new_df = pd.concat([df1,df2]).drop_duplicates(keep=False)
new_df[~new_df.index.duplicated(keep='last')]
As you need to keep only the difference, You can do it this way
# I just added the full example
import pandas as pd
df1 = pd.DataFrame({
'date':[ '2022-09-01', '2022-09-02', '2022-09-03', '2022-09-04'
],
'L120_active_cohort_logins':[32679 ,32938 ,40746, 33979],
'L120_active_cohort':[195345, 196457, 197586, 198799],
'percentage_L120_active_cohort_logins':[16.728865 ,16.76601 ,20.621906, 17.092138],
})
df2 = pd.DataFrame({
'date':[ '2022-09-01', '2022-09-02', '2022-09-03', '2022-09-04'
],
'L120_active_cohort_logins':[32677 ,32938 ,40746, 33979],
'L120_active_cohort':[195345, 196457, 197586, 198799],
'percentage_L120_active_cohort_logins':[16.728865 ,16.76601 ,20.621906, 17.092138],
})
df3= pd.merge(df1, df2,how='outer').drop( pd.merge(df1, df2,left_index=True, right_index=True, how='inner').index)
print(df3)
This worked for me
pd_df1 = pd.merge(click_df1, click_df2, on="L120_active_cohort_logins", how='outer', indicator='Exist')
pd_df1 = pd_df1.loc[pd_df1['Exist'] != 'both']
final_df = pd_df1[pd_df1['Exist'] == 'right_only'][['date_y','L120_active_cohort_logins','L120_active_cohort_y','percentage_L120_active_cohort_logins_y']]
columns = ['date','L120_active_cohort_logins','L120_active_cohort','percentage_L120_active_cohort_logins']
final_df.columns = columns
Dataframe 1 (df1):-
date L120_active_cohort_logins L120_active_cohort percentage_L120_active_cohort_logins
0 2022-09-01 32679 195345 16.728865
1 2022-09-02 32938 196457 16.766010
2 2022-09-03 40746 197586 20.621906
3 2022-09-04 33979 198799 17.092138
Dataframe 2(df2):-
date L120_active_cohort_logins L120_active_cohort percentage_L120_active_cohort_logins
0 2022-09-01 32677 195345 16.728864
1 2022-09-02 32938 196457 16.766010
2 2022-09-03 40746 197586 20.621906
3 2022-09-04 33979 198799 17.092138
result df3 = df2 – df1
I want df2 not matching with df1 particular row to be stored in df3
output :-
date L120_active_cohort_logins L120_active_cohort percentage_L120_active_cohort_logins
0 2022-09-01 32677 195345 16.728864
reference Comparing two dataframes and getting the differences
new_df = pd.concat([df1,df2]).drop_duplicates(keep=False)
new_df[~new_df.index.duplicated(keep='last')]
As you need to keep only the difference, You can do it this way
# I just added the full example
import pandas as pd
df1 = pd.DataFrame({
'date':[ '2022-09-01', '2022-09-02', '2022-09-03', '2022-09-04'
],
'L120_active_cohort_logins':[32679 ,32938 ,40746, 33979],
'L120_active_cohort':[195345, 196457, 197586, 198799],
'percentage_L120_active_cohort_logins':[16.728865 ,16.76601 ,20.621906, 17.092138],
})
df2 = pd.DataFrame({
'date':[ '2022-09-01', '2022-09-02', '2022-09-03', '2022-09-04'
],
'L120_active_cohort_logins':[32677 ,32938 ,40746, 33979],
'L120_active_cohort':[195345, 196457, 197586, 198799],
'percentage_L120_active_cohort_logins':[16.728865 ,16.76601 ,20.621906, 17.092138],
})
df3= pd.merge(df1, df2,how='outer').drop( pd.merge(df1, df2,left_index=True, right_index=True, how='inner').index)
print(df3)
This worked for me
pd_df1 = pd.merge(click_df1, click_df2, on="L120_active_cohort_logins", how='outer', indicator='Exist')
pd_df1 = pd_df1.loc[pd_df1['Exist'] != 'both']
final_df = pd_df1[pd_df1['Exist'] == 'right_only'][['date_y','L120_active_cohort_logins','L120_active_cohort_y','percentage_L120_active_cohort_logins_y']]
columns = ['date','L120_active_cohort_logins','L120_active_cohort','percentage_L120_active_cohort_logins']
final_df.columns = columns