Merge two dataframes groupby the column values of a dataframe
Question:
I have the following dataframe:
df1 = pd.DataFrame({'Parent': ['Stay home', "Stay home","Stay home", 'Go outside', "Go Outside","Go outside"],
'Child' : ['Severe weather', "Severe weather", "Severe weather", 'Sunny', "Sunny", "sunny"]})
Parent Child
0 Stay home Severe weather
1 Stay home Severe weather
2 Stay home Severe weather
3 Go outside Sunny
4 Go Outside Sunny
5 Go outside sunny
And a second one:
df2 = pd.DataFrame({'Similarity_Score': ['SimilarityScore:0.43693185876069784', 'SimilarityScore:0.299807821163373']})
Similarity_Score
0 SimilarityScore:0.43693185876069784
1 SimilarityScore:0.299807821163373
I want to merge the two dataframes based on the values of Child values of df1.
Expected outcome:
Parent Child Similarity_Score
0 Stay home Severe weather 0.43693185876069784
1 Stay home Severe weather 0.43693185876069784
2 Stay home Severe weather 0.43693185876069784
3 Go outside Sunny 0.299807821163373
4 Go Outside Sunny 0.299807821163373
5 Go outside sunny 0.299807821163373
I tried the usual merge
and concat
methods but could not find the solution. Any ideas?
Answers:
based on your responses, the merge is based on the index, after taking identifying unique values in df1
# identifying the group
df1['key']=df1.groupby(['Parent','Child']).ngroup(ascending=False)
df1
# merge the two DF, and while merging split the similarity scope to take only numeric part
(df1.merge(df2['Similarity_Score'].str.split(':', expand=True)[1],
left_on='key',
right_index=True)
.drop(columns='key'))
Parent Child 1
0 Stay home Severe weather 0.43693185876069784
1 Stay home Severe weather 0.43693185876069784
2 Stay home Severe weather 0.43693185876069784
3 Go outside Sunny 0.299807821163373
4 Go outside Sunny 0.299807821163373
5 Go outside Sunny 0.299807821163373
Is there a specific parameter you want to join based of? Are the scores constant for for Sunny (~0.299) and Severe Weather? If that’s the case then make a binary column and merge based on that.
If you want to assign the scores based on the Child’s value, you can do it like that:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'Parent': ['Stay home', "Stay home", "Stay home", 'Go outside', "Go Outside", "Go outside"],
'Child': ['Severe weather', "Severe weather", "Severe weather", 'Sunny', "Sunny", "Sunny"]})
df2 = pd.DataFrame({'Similarity_Score': ['SimilarityScore:0.43693185876069784', 'SimilarityScore:0.299807821163373']})
# Split the string at : and convert to float
df2['Similarity_Score'] = df2['Similarity_Score'].str.split(':').str[1].astype(float)
# calculate auxiliary column position to base the matching on
df1['position'] = df1['Child'].apply(lambda row: np.where(df1['Child'].unique() == row)[0][0])
# merge both dataframes and drop auxiliary column position
df = df1.merge(df2, left_on='position', right_index=True).drop(columns=["position"])
I have the following dataframe:
df1 = pd.DataFrame({'Parent': ['Stay home', "Stay home","Stay home", 'Go outside', "Go Outside","Go outside"],
'Child' : ['Severe weather', "Severe weather", "Severe weather", 'Sunny', "Sunny", "sunny"]})
Parent Child
0 Stay home Severe weather
1 Stay home Severe weather
2 Stay home Severe weather
3 Go outside Sunny
4 Go Outside Sunny
5 Go outside sunny
And a second one:
df2 = pd.DataFrame({'Similarity_Score': ['SimilarityScore:0.43693185876069784', 'SimilarityScore:0.299807821163373']})
Similarity_Score
0 SimilarityScore:0.43693185876069784
1 SimilarityScore:0.299807821163373
I want to merge the two dataframes based on the values of Child values of df1.
Expected outcome:
Parent Child Similarity_Score
0 Stay home Severe weather 0.43693185876069784
1 Stay home Severe weather 0.43693185876069784
2 Stay home Severe weather 0.43693185876069784
3 Go outside Sunny 0.299807821163373
4 Go Outside Sunny 0.299807821163373
5 Go outside sunny 0.299807821163373
I tried the usual merge
and concat
methods but could not find the solution. Any ideas?
based on your responses, the merge is based on the index, after taking identifying unique values in df1
# identifying the group
df1['key']=df1.groupby(['Parent','Child']).ngroup(ascending=False)
df1
# merge the two DF, and while merging split the similarity scope to take only numeric part
(df1.merge(df2['Similarity_Score'].str.split(':', expand=True)[1],
left_on='key',
right_index=True)
.drop(columns='key'))
Parent Child 1
0 Stay home Severe weather 0.43693185876069784
1 Stay home Severe weather 0.43693185876069784
2 Stay home Severe weather 0.43693185876069784
3 Go outside Sunny 0.299807821163373
4 Go outside Sunny 0.299807821163373
5 Go outside Sunny 0.299807821163373
Is there a specific parameter you want to join based of? Are the scores constant for for Sunny (~0.299) and Severe Weather? If that’s the case then make a binary column and merge based on that.
If you want to assign the scores based on the Child’s value, you can do it like that:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'Parent': ['Stay home', "Stay home", "Stay home", 'Go outside', "Go Outside", "Go outside"],
'Child': ['Severe weather', "Severe weather", "Severe weather", 'Sunny', "Sunny", "Sunny"]})
df2 = pd.DataFrame({'Similarity_Score': ['SimilarityScore:0.43693185876069784', 'SimilarityScore:0.299807821163373']})
# Split the string at : and convert to float
df2['Similarity_Score'] = df2['Similarity_Score'].str.split(':').str[1].astype(float)
# calculate auxiliary column position to base the matching on
df1['position'] = df1['Child'].apply(lambda row: np.where(df1['Child'].unique() == row)[0][0])
# merge both dataframes and drop auxiliary column position
df = df1.merge(df2, left_on='position', right_index=True).drop(columns=["position"])