Pandas: Update master dataframe with additional columns, with growing number of common columns in subsequent merges

Question:

I’m trying to figure out how to achieve the following. Say that I have a master dataframe with an ID column and various other data:

ID     A       B       C 
01     1x      1y      1z 
02     2x      2y      2z 
03     3x      3y      3z 
04     4x      4y      4z 
... 
01000  01000x  01000y  01000z

And I additionally have additional, single-row data for each of these IDs in separate, very wide dataframes:

ID   New1    New2    New3    ...  New50 
01   01val1  01val2  01val3  ...  01val50

My aim is to create a very sparse matrix that would add all of these columns to the master dataframe after merging on their common ID value.

The complicating factor is: the additional dataframes of each different ID will have multiple overlapping column names with the dataframes of the other IDs, and I want to keep the data in the same column when that is the case.

A desired example output would look something like this:

ID     A       B       C       New1     New2     New3      ...  New50    ...  New200
01     1x      1y      1z      01val1   01val2   01val3    ...  01val50  ...  n/a
02     2x      2y      2z      n/a      n/a      02val3    ...  n/a      ...  02val200
03     3x      3y      3z      03val1   n/a      n/a       ...  03val50  ...  n/a
04     4x      4y      4z      n/a      04val2   n/a       ...  n/a      ...  n/a
...
01000  01000x  01000y  01000z  n/a      n/a      01000val3 ...  n/a      ...  01000val200

So if I merge ID 01’s dataframe with the master and end up adding 20 new columns, and then merge ID 02’s dataframe with the master and 10 of those columns have the same name, ID 02’s values for the common columns are simply inserted rather than, e.g., New1_x and New1_y showing up.

I’ve tried two strategies. By repeatedly merging on:

master.merge(new_data, on='ID', how='left')

as part of a for loop, all new dataframe columns get added as New1_x and New1_y, and I can’t simply drop all of the *y columns because the actual values are separated such that ID 01 will have its New1 value in New1_x, while ID 02 will have its New1 value in New1_y.

By identifying ALL common columns before each merge and doing:

common = list(master.columns.intersection(new_data.columns))

master = master.merge(new_data, on=common, how='left')

I get a non-unique error eventually, and the updates that do work simply add no values to the columns so that only the first additional data frame is accounted for. Essentially it looks like this:

ID  A   B   C   New1    New2    New3    New4
01  1x  1y  1z  01val1  01val2  01val3  n/a
02  2x  2y  2z  -       -       -       -
Asked By: Zach Champion

||

Answers:

The concat function does exactly what you need

master = pd.DataFrame([['1x','1y'],['2x','2y']],index=['01','02'],columns=['A','B'])
new1 = pd.DataFrame([['abc','def']],index=['01'],columns=['New1','New2'])
new2 = pd.DataFrame([['sdf','res']],index=['02'],columns=['New1','New3'])

newc =  pd.concat([new1,new2],axis=0)

master.merge(newc,left_index=True,right_index=True)

The list in concat can be expanded at will.

Answered By: Arnau
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.