pandas: merge (join) two data frames on multiple columns

Question:

I am trying to join two pandas dataframes using two columns:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

but got the following error:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()

KeyError: '[B_1, c2]'

Any idea what should be the right way to do this?

Asked By: Edamame

||

Answers:

Try this

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

left_on : label or list, or array-like Field names to join on in left
DataFrame. Can be a vector or list of vectors of the length of the
DataFrame to use a particular vector as the join key instead of
columns

right_on : label or list, or array-like Field names to join on
in right DataFrame or vector/list of vectors per left_on docs

Answered By: Shijo

the problem here is that by using the apostrophes you are setting the value being passed to be a string, when in fact, as @Shijo stated from the documentation, the function is expecting a label or list, but not a string! If the list contains each of the name of the columns beings passed for both the left and right dataframe, then each column-name must individually be within apostrophes. With what has been stated, we can understand why this is inccorect:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

And this is the correct way of using the function:

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])
Answered By: Celius Stingher

Another way of doing this:

new_df = A_df.merge(B_df, left_on=['A_c1','c2'], right_on = ['B_c1','c2'], how='left')
Answered By: john ed

you can use below which is short and simple to understand:

merged_data= df1.merge(df2, on=["column1","column2"])
Answered By: Ali karimi

this work for me , for n files xls

# all_reports_paths contain one array with all paths per files
for a in all_reports_paths:
    
    df.append( pd.read_excel(a,skiprows=X,skipfooter=X))

df_glob = pd.DataFrame(columns=columns)

for dataframe in df:

    df_glob = pd.concat([df_glob,pd.DataFrame(dataframe)],axis=0)

# finally df_glob contain all data
  1. It merges according to the ordering of left_on and right_on, i.e., the i-th element of left_on will match with the i-th of right_on.

    In the example below, the code on the top matches A_col1 with B_col1 and A_col2 with B_col2, while the code on the bottom matches A_col1 with B_col2 and A_col2 with B_col1. Evidently, the results are different.

    res1

  2. As can be seen from the above example, if the merge keys have different names, all keys will show up as their individual columns in the merged dataframe. In the example above, in the top dataframe, A_col1 and B_col1 are identical and A_col2 and B_col2 are identical. In the bottom dataframe, A_col1 and B_col2 are identical and A_col2 and B_col1 are identical. Since these are duplicate columns, they are most likely not needed. One way to not have this problem from the beginning is to make the merge keys identical from the beginning. See bullet point #3 below.

  3. If left_on and right_on are the same col1 and col2, we can use on=['col1', 'col2']. In this case, no merge keys are duplicated.

    df1.merge(df2, on=['col1', 'col2'])
    

    res3

  4. You can also merge one side on column names and the other side on index too. For example, in the example below, df1‘s columns are matched with df2‘s indices. If the indices are named, as in the example below, you can reference them by name but if not, you can also use right_index=True (or left_index=True if the left dataframe is the one being merged on index).

    df1.merge(df2, left_on=['A_col1', 'A_col2'], right_index=True)
    # or
    df1.merge(df2, left_on=['A_col1', 'A_col2'], right_on=['B_col1', 'B_col2'])
    

    res3

  5. By using the how= parameter, you can perform LEFT JOIN (how='left'), FULL OUTER JOIN (how='outer') and RIGHT JOIN (how='right') as well. The default is INNER JOIN (how='inner') as in the examples above.

  6. If you have more than 2 dataframes to merge and the merge keys are the same across all of them, then join method is more efficient than merge because you can pass a list of dataframes and join on indices. Note that the index names are the same across all dataframes in the example below (col1 and col2). Note that the indices don’t have to have names; if the indices don’t have names, then the number of the multi-indices must match (in the case below there are 2 multi-indices). Again, as in bullet point #1, the match occurs according to the ordering of the indices.

    df1.join([df2, df3], how='inner').reset_index()
    

    res4

Answered By: cottontail
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.