Merge two Dataframe without duplicate rows pandas

Question:

We have problem in merge two different Pandas data frame, using the merge method it duplicate the rows with the same ID

We have two different DF in Pandas:

DF1:

   ID  Gender
0   1    Male
1   2  Female
2   3  Female

DF2

   ID  Vote
0   1    30
1   2    27
2   2    22

We want this result as output:

   ID  Gender  Vote 1  Vote 2
0   1    Male      30    <NA>
1   2  Female      27      22
2   3  Female    <NA>    <NA>
Asked By: Acco

||

Answers:

You have to reshape your second dataframe before merging to the first one:

df2a = df2.pivot_table(index='ID', columns=df2.groupby('ID').cumcount().add(1), 
                       values='Vote').add_prefix('Vote ')

out = df1.merge(df2a, on='ID', how='left')
print(out)

# Output
   ID  Gender  Vote 1  Vote 2
0   1    Male    30.0     NaN
1   2  Female    27.0    22.0
2   3  Female     NaN     NaN
Answered By: Corralien

Another way to do this is as follows:

We have first dataframe:

data1 = [[1, 'Male'], [2, 'Female'], [3, 'Female']]
df1 = pd.DataFrame(data1, columns=['ID', 'Gender'])

# Output
ID  Gender
1   Male
2   Female
3   Female

And the second dataframe:

data2 = [[1, 30], [2, 27], [2, 22]]
df2 = pd.DataFrame(data2, columns=['ID', 'Vote'])

# Output
ID  Vote
1   30
2   27
2   22

We first merge:

merged_df = df1.merge(df2, on='ID', how='outer')

#Output:

   ID   Gender  Vote
0   1   Male    30.0
1   2   Female  27.0
2   2   Female  22.0
3   3   Female  NaN

Assuming that df1 have all unique rows, you can create your required dataframe as follows:

final_df_list = []
for id in merged_df.ID.unique():
    prefix = 'Vote'
    suffix = 1
    current_df = merged_df.loc[merged_df.ID == id]
    gender = merged_df.loc[merged_df.ID == id]['Gender'].values[0]
    data = [[id, gender]]
    columns = ['ID', 'Gender']
    for vote in current_df.Vote.values:
        data[0].append(vote)
        columns.append(f'{prefix}{suffix}')
        suffix += 1
    final_df_list.append(pd.DataFrame(data=data, columns=columns))
output_df = pd.concat(final_df_list, sort = False)

# output_df:

   ID   Gender  Vote1   Vote2
0   1   Male    30.0    NaN
0   2   Female  27.0    22.0
0   3   Female  NaN     NaN
Answered By: Akshay Rajeev