Pandas: Update master dataframe with additional columns, with growing number of common columns in subsequent merges
Question:
I’m trying to figure out how to achieve the following. Say that I have a master dataframe with an ID column and various other data:
ID A B C
01 1x 1y 1z
02 2x 2y 2z
03 3x 3y 3z
04 4x 4y 4z
...
01000 01000x 01000y 01000z
And I additionally have additional, single-row data for each of these IDs in separate, very wide dataframes:
ID New1 New2 New3 ... New50
01 01val1 01val2 01val3 ... 01val50
My aim is to create a very sparse matrix that would add all of these columns to the master dataframe after merging on their common ID value.
The complicating factor is: the additional dataframes of each different ID will have multiple overlapping column names with the dataframes of the other IDs, and I want to keep the data in the same column when that is the case.
A desired example output would look something like this:
ID A B C New1 New2 New3 ... New50 ... New200
01 1x 1y 1z 01val1 01val2 01val3 ... 01val50 ... n/a
02 2x 2y 2z n/a n/a 02val3 ... n/a ... 02val200
03 3x 3y 3z 03val1 n/a n/a ... 03val50 ... n/a
04 4x 4y 4z n/a 04val2 n/a ... n/a ... n/a
...
01000 01000x 01000y 01000z n/a n/a 01000val3 ... n/a ... 01000val200
So if I merge ID 01’s dataframe with the master and end up adding 20 new columns, and then merge ID 02’s dataframe with the master and 10 of those columns have the same name, ID 02’s values for the common columns are simply inserted rather than, e.g., New1_x and New1_y showing up.
I’ve tried two strategies. By repeatedly merging on:
master.merge(new_data, on='ID', how='left')
as part of a for loop, all new dataframe columns get added as New1_x and New1_y, and I can’t simply drop all of the *y columns because the actual values are separated such that ID 01 will have its New1 value in New1_x, while ID 02 will have its New1 value in New1_y.
By identifying ALL common columns before each merge and doing:
common = list(master.columns.intersection(new_data.columns))
master = master.merge(new_data, on=common, how='left')
I get a non-unique error eventually, and the updates that do work simply add no values to the columns so that only the first additional data frame is accounted for. Essentially it looks like this:
ID A B C New1 New2 New3 New4
01 1x 1y 1z 01val1 01val2 01val3 n/a
02 2x 2y 2z - - - -
Answers:
The concat
function does exactly what you need
master = pd.DataFrame([['1x','1y'],['2x','2y']],index=['01','02'],columns=['A','B'])
new1 = pd.DataFrame([['abc','def']],index=['01'],columns=['New1','New2'])
new2 = pd.DataFrame([['sdf','res']],index=['02'],columns=['New1','New3'])
newc = pd.concat([new1,new2],axis=0)
master.merge(newc,left_index=True,right_index=True)
The list in concat
can be expanded at will.
I’m trying to figure out how to achieve the following. Say that I have a master dataframe with an ID column and various other data:
ID A B C
01 1x 1y 1z
02 2x 2y 2z
03 3x 3y 3z
04 4x 4y 4z
...
01000 01000x 01000y 01000z
And I additionally have additional, single-row data for each of these IDs in separate, very wide dataframes:
ID New1 New2 New3 ... New50
01 01val1 01val2 01val3 ... 01val50
My aim is to create a very sparse matrix that would add all of these columns to the master dataframe after merging on their common ID value.
The complicating factor is: the additional dataframes of each different ID will have multiple overlapping column names with the dataframes of the other IDs, and I want to keep the data in the same column when that is the case.
A desired example output would look something like this:
ID A B C New1 New2 New3 ... New50 ... New200
01 1x 1y 1z 01val1 01val2 01val3 ... 01val50 ... n/a
02 2x 2y 2z n/a n/a 02val3 ... n/a ... 02val200
03 3x 3y 3z 03val1 n/a n/a ... 03val50 ... n/a
04 4x 4y 4z n/a 04val2 n/a ... n/a ... n/a
...
01000 01000x 01000y 01000z n/a n/a 01000val3 ... n/a ... 01000val200
So if I merge ID 01’s dataframe with the master and end up adding 20 new columns, and then merge ID 02’s dataframe with the master and 10 of those columns have the same name, ID 02’s values for the common columns are simply inserted rather than, e.g., New1_x and New1_y showing up.
I’ve tried two strategies. By repeatedly merging on:
master.merge(new_data, on='ID', how='left')
as part of a for loop, all new dataframe columns get added as New1_x and New1_y, and I can’t simply drop all of the *y columns because the actual values are separated such that ID 01 will have its New1 value in New1_x, while ID 02 will have its New1 value in New1_y.
By identifying ALL common columns before each merge and doing:
common = list(master.columns.intersection(new_data.columns))
master = master.merge(new_data, on=common, how='left')
I get a non-unique error eventually, and the updates that do work simply add no values to the columns so that only the first additional data frame is accounted for. Essentially it looks like this:
ID A B C New1 New2 New3 New4
01 1x 1y 1z 01val1 01val2 01val3 n/a
02 2x 2y 2z - - - -
The concat
function does exactly what you need
master = pd.DataFrame([['1x','1y'],['2x','2y']],index=['01','02'],columns=['A','B'])
new1 = pd.DataFrame([['abc','def']],index=['01'],columns=['New1','New2'])
new2 = pd.DataFrame([['sdf','res']],index=['02'],columns=['New1','New3'])
newc = pd.concat([new1,new2],axis=0)
master.merge(newc,left_index=True,right_index=True)
The list in concat
can be expanded at will.