Merge two Dataframe without duplicate rows pandas
Question:
We have problem in merge two different Pandas data frame, using the merge method it duplicate the rows with the same ID
We have two different DF in Pandas:
DF1:
ID Gender
0 1 Male
1 2 Female
2 3 Female
DF2
ID Vote
0 1 30
1 2 27
2 2 22
We want this result as output:
ID Gender Vote 1 Vote 2
0 1 Male 30 <NA>
1 2 Female 27 22
2 3 Female <NA> <NA>
Answers:
You have to reshape your second dataframe before merging to the first one:
df2a = df2.pivot_table(index='ID', columns=df2.groupby('ID').cumcount().add(1),
values='Vote').add_prefix('Vote ')
out = df1.merge(df2a, on='ID', how='left')
print(out)
# Output
ID Gender Vote 1 Vote 2
0 1 Male 30.0 NaN
1 2 Female 27.0 22.0
2 3 Female NaN NaN
Another way to do this is as follows:
We have first dataframe:
data1 = [[1, 'Male'], [2, 'Female'], [3, 'Female']]
df1 = pd.DataFrame(data1, columns=['ID', 'Gender'])
# Output
ID Gender
1 Male
2 Female
3 Female
And the second dataframe:
data2 = [[1, 30], [2, 27], [2, 22]]
df2 = pd.DataFrame(data2, columns=['ID', 'Vote'])
# Output
ID Vote
1 30
2 27
2 22
We first merge:
merged_df = df1.merge(df2, on='ID', how='outer')
#Output:
ID Gender Vote
0 1 Male 30.0
1 2 Female 27.0
2 2 Female 22.0
3 3 Female NaN
Assuming that df1 have all unique rows, you can create your required dataframe as follows:
final_df_list = []
for id in merged_df.ID.unique():
prefix = 'Vote'
suffix = 1
current_df = merged_df.loc[merged_df.ID == id]
gender = merged_df.loc[merged_df.ID == id]['Gender'].values[0]
data = [[id, gender]]
columns = ['ID', 'Gender']
for vote in current_df.Vote.values:
data[0].append(vote)
columns.append(f'{prefix}{suffix}')
suffix += 1
final_df_list.append(pd.DataFrame(data=data, columns=columns))
output_df = pd.concat(final_df_list, sort = False)
# output_df:
ID Gender Vote1 Vote2
0 1 Male 30.0 NaN
0 2 Female 27.0 22.0
0 3 Female NaN NaN
We have problem in merge two different Pandas data frame, using the merge method it duplicate the rows with the same ID
We have two different DF in Pandas:
DF1:
ID Gender
0 1 Male
1 2 Female
2 3 Female
DF2
ID Vote
0 1 30
1 2 27
2 2 22
We want this result as output:
ID Gender Vote 1 Vote 2
0 1 Male 30 <NA>
1 2 Female 27 22
2 3 Female <NA> <NA>
You have to reshape your second dataframe before merging to the first one:
df2a = df2.pivot_table(index='ID', columns=df2.groupby('ID').cumcount().add(1),
values='Vote').add_prefix('Vote ')
out = df1.merge(df2a, on='ID', how='left')
print(out)
# Output
ID Gender Vote 1 Vote 2
0 1 Male 30.0 NaN
1 2 Female 27.0 22.0
2 3 Female NaN NaN
Another way to do this is as follows:
We have first dataframe:
data1 = [[1, 'Male'], [2, 'Female'], [3, 'Female']]
df1 = pd.DataFrame(data1, columns=['ID', 'Gender'])
# Output
ID Gender
1 Male
2 Female
3 Female
And the second dataframe:
data2 = [[1, 30], [2, 27], [2, 22]]
df2 = pd.DataFrame(data2, columns=['ID', 'Vote'])
# Output
ID Vote
1 30
2 27
2 22
We first merge:
merged_df = df1.merge(df2, on='ID', how='outer')
#Output:
ID Gender Vote
0 1 Male 30.0
1 2 Female 27.0
2 2 Female 22.0
3 3 Female NaN
Assuming that df1 have all unique rows, you can create your required dataframe as follows:
final_df_list = []
for id in merged_df.ID.unique():
prefix = 'Vote'
suffix = 1
current_df = merged_df.loc[merged_df.ID == id]
gender = merged_df.loc[merged_df.ID == id]['Gender'].values[0]
data = [[id, gender]]
columns = ['ID', 'Gender']
for vote in current_df.Vote.values:
data[0].append(vote)
columns.append(f'{prefix}{suffix}')
suffix += 1
final_df_list.append(pd.DataFrame(data=data, columns=columns))
output_df = pd.concat(final_df_list, sort = False)
# output_df:
ID Gender Vote1 Vote2
0 1 Male 30.0 NaN
0 2 Female 27.0 22.0
0 3 Female NaN NaN