Join two Pandas dataframes with new column containing combined matching results
Question:
Apologies if this has been answered already, but I wasn’t able to find a similar post.
I’ve got two Pandas dataframes that I’d like to merge. Dataframe1 contains data which has failed validation. Dataframe2 contains the detail for each row where the errors have occurred (ErrorColumn).
As you can see in Dataframe2, there can be multiple errors for a single row. I need to consolidate the errors, then append them as a new column (ErrorColumn) in Dataframe1.
Example below
Dataframe 1:
ErrorRow
MaterialID
Description
UnitCost
Quantity
Critical
Location
3
nan
Part 1
nan
100
false
West
4
nan
Part 2
12
nan
true
East
7
56779
Part 3
25
nan
false
West
Dataframe 2:
ErrorRow
ErrorColumn
3
MaterialID
3
UnitCost
4
MaterialID
4
Quantity
7
Quantity
Result:
ErrorRow
MaterialID
Description
UnitCost
Quantity
Critical
Location
ErrorColumn
3
nan
Part 1
nan
100
false
West
MaterialID, UnitCost
4
nan
Part 2
12
nan
true
East
MaterialID, Quantity
7
56779
Part 3
25
nan
false
West
Quantity
Any assistance is appreciated. I’m new to Python, there’s likely a simple solution that I have yet to find/learn.
Answers:
You can use pandas.DataFrame.merge
with GroupBy.agg
:
out = df1.merge(df2.groupby("ErrorRow", as_index=False).agg(", ".join), on="ErrorRow")
#or if set needed, use GroupBy.agg(set)
# Output :
print(out.to_string())
ErrorRow MaterialID Description UnitCost Quantity Critical Location ErrorColumn
0 3 NaN Part 1 NaN 100.0 False West MaterialID, UnitCost
1 4 NaN Part 2 12.0 NaN True East MaterialID, Quantity
2 7 56779.0 Part 3 25.0 NaN False West Quantity
Apologies if this has been answered already, but I wasn’t able to find a similar post.
I’ve got two Pandas dataframes that I’d like to merge. Dataframe1 contains data which has failed validation. Dataframe2 contains the detail for each row where the errors have occurred (ErrorColumn).
As you can see in Dataframe2, there can be multiple errors for a single row. I need to consolidate the errors, then append them as a new column (ErrorColumn) in Dataframe1.
Example below
Dataframe 1:
ErrorRow | MaterialID | Description | UnitCost | Quantity | Critical | Location |
---|---|---|---|---|---|---|
3 | nan | Part 1 | nan | 100 | false | West |
4 | nan | Part 2 | 12 | nan | true | East |
7 | 56779 | Part 3 | 25 | nan | false | West |
Dataframe 2:
ErrorRow | ErrorColumn |
---|---|
3 | MaterialID |
3 | UnitCost |
4 | MaterialID |
4 | Quantity |
7 | Quantity |
Result:
ErrorRow | MaterialID | Description | UnitCost | Quantity | Critical | Location | ErrorColumn |
---|---|---|---|---|---|---|---|
3 | nan | Part 1 | nan | 100 | false | West | MaterialID, UnitCost |
4 | nan | Part 2 | 12 | nan | true | East | MaterialID, Quantity |
7 | 56779 | Part 3 | 25 | nan | false | West | Quantity |
Any assistance is appreciated. I’m new to Python, there’s likely a simple solution that I have yet to find/learn.
You can use pandas.DataFrame.merge
with GroupBy.agg
:
out = df1.merge(df2.groupby("ErrorRow", as_index=False).agg(", ".join), on="ErrorRow")
#or if set needed, use GroupBy.agg(set)
# Output :
print(out.to_string())
ErrorRow MaterialID Description UnitCost Quantity Critical Location ErrorColumn
0 3 NaN Part 1 NaN 100.0 False West MaterialID, UnitCost
1 4 NaN Part 2 12.0 NaN True East MaterialID, Quantity
2 7 56779.0 Part 3 25.0 NaN False West Quantity