Pandas Merge Many Rows of One Dataframe into Fewer Rows of Second Dataframe
Question:
Is there an easy way to merge two dataframes such that df2 adds all its rows with matching ‘on’ values as new columns in df1? Open to other methods of joining the data as well.
e.g.
Matching on Course Offering Code and User Id
df1:
Course Offering Code
User Id
001
1
001
2
df2:
Course Offering Code
User Id
Assignment
grade%
001
1
A01
65
001
1
A02
85
001
1
A03
95
001
1
A04
64
001
2
A01
87
001
2
A02
86
001
2
A03
82
001
2
A04
90
I had tried pd.merge(df1, df2, on=['User Id', 'Course Offering Code'])
and was hoping for the following:
desired_df
Course Offering Code
User Id
Assignment_x
grade%_x
Assignment_y
grade%_y
Assignment_z
grade%_z
Assignment_a
grade%_a
001
1
A01
65
A02
85
A03
95
A04
64
001
2
A01
87
A02
86
A03
82
A04
90
Answers:
You can do this one with pandas.DataFrame.pivot
:
def flatten_cols(df):
df.columns = ['_'.join(map(str, x)) for x in df.columns]
df = df[sorted(df.columns, key=lambda x: int(x.split("_")[-1]))]
return df
out = (
df1
.merge(df2, on=["Course Offering Code", "User Id"], how="left")
.assign(idx=lambda x: x.groupby(['Course Offering Code', 'User Id']).cumcount()+1)
.pivot(index= ["Course Offering Code", "User Id"],
columns= "idx", values=["Assignment", "grade%"])
.pipe(flatten_cols)
.reset_index()
)
# Output :
print(out.to_string())
Course Offering Code User Id Assignment_1 grade%_1 Assignment_2 grade%_2 Assignment_3 grade%_3 Assignment_4 grade%_4
0 001 1 A01 65 A02 85 A03 95 A04 64
1 001 2 A01 87 A02 86 A03 82 A04 90
Is there an easy way to merge two dataframes such that df2 adds all its rows with matching ‘on’ values as new columns in df1? Open to other methods of joining the data as well.
e.g.
Matching on Course Offering Code and User Id
df1:
Course Offering Code | User Id |
---|---|
001 | 1 |
001 | 2 |
df2:
Course Offering Code | User Id | Assignment | grade% |
---|---|---|---|
001 | 1 | A01 | 65 |
001 | 1 | A02 | 85 |
001 | 1 | A03 | 95 |
001 | 1 | A04 | 64 |
001 | 2 | A01 | 87 |
001 | 2 | A02 | 86 |
001 | 2 | A03 | 82 |
001 | 2 | A04 | 90 |
I had tried pd.merge(df1, df2, on=['User Id', 'Course Offering Code'])
and was hoping for the following:
desired_df
Course Offering Code | User Id | Assignment_x | grade%_x | Assignment_y | grade%_y | Assignment_z | grade%_z | Assignment_a | grade%_a |
---|---|---|---|---|---|---|---|---|---|
001 | 1 | A01 | 65 | A02 | 85 | A03 | 95 | A04 | 64 |
001 | 2 | A01 | 87 | A02 | 86 | A03 | 82 | A04 | 90 |
You can do this one with pandas.DataFrame.pivot
:
def flatten_cols(df):
df.columns = ['_'.join(map(str, x)) for x in df.columns]
df = df[sorted(df.columns, key=lambda x: int(x.split("_")[-1]))]
return df
out = (
df1
.merge(df2, on=["Course Offering Code", "User Id"], how="left")
.assign(idx=lambda x: x.groupby(['Course Offering Code', 'User Id']).cumcount()+1)
.pivot(index= ["Course Offering Code", "User Id"],
columns= "idx", values=["Assignment", "grade%"])
.pipe(flatten_cols)
.reset_index()
)
# Output :
print(out.to_string())
Course Offering Code User Id Assignment_1 grade%_1 Assignment_2 grade%_2 Assignment_3 grade%_3 Assignment_4 grade%_4
0 001 1 A01 65 A02 85 A03 95 A04 64
1 001 2 A01 87 A02 86 A03 82 A04 90