Join pandas dataframes based on column values

Question:

I’m quite new to pandas dataframes, and I’m experiencing some troubles joining two tables.

The first df has just 3 columns:

DF1:

item_id    position    document_id
336        1           10
337        2           10
338        3           10
1001       1           11
1002       2           11
1003       3           11
38         10          146

And the second has exactly same two columns (and plenty of others):

DF2:

item_id    document_id    col1    col2   col3    ...
337        10             ...     ...    ...
1002       11             ...     ...    ...
1003       11             ...     ...    ...

What I need is to perform an operation which, in SQL, would look as follows:

DF1 join DF2 on 
DF1.document_id = DF2.document_id
and
DF1.item_id = DF2.item_id

And, as a result, I want to see DF2, complemented with column ‘position’:

item_id    document_id    position    col1   col2   col3   ...

What is a good way to do this using pandas?

Asked By: fremorie

||

Answers:

I think you need merge with default inner join, but is necessary no duplicated combinations of values in both columns:

print (df2)
   item_id  document_id col1  col2  col3
0      337           10    s     4     7
1     1002           11    d     5     8
2     1003           11    f     7     0

df = pd.merge(df1, df2, on=['document_id','item_id'])
print (df)
   item_id  position  document_id col1  col2  col3
0      337         2           10    s     4     7
1     1002         2           11    d     5     8
2     1003         3           11    f     7     0

But if necessary position column in position 3:

df = pd.merge(df2, df1, on=['document_id','item_id'])
cols = df.columns.tolist()
df = df[cols[:2] + cols[-1:] + cols[2:-1]]
print (df)
   item_id  document_id  position col1  col2  col3
0      337           10         2    s     4     7
1     1002           11         2    d     5     8
2     1003           11         3    f     7     0
Answered By: jezrael

If you’re merging on all common columns as in the OP, you don’t even need to pass on=, simply calling merge() will do the job.

merged_df = df1.merge(df2)

res1

The reason is that under the hood, if on= is not passed, pd.Index.intersection is called on the columns to determine the common columns and merge on all of them.

A special thing about merging on common columns is that it doesn’t matter which dataframe is on the right or the left, the rows filtered are the same because they are selected by looking up matching rows on the common columns. The only difference is where the columns are positioned; the columns in the right dataframe that are not in the left dataframe will be added to the right of the columns on the left dataframe. So unless the order of the columns matter (which can be very easily fixed using column selection or reindex()), it doesn’t really matter which dataframe is on the right and which is on the left. In other words,

df12 = df1.merge(df2, on=['document_id','item_id']).sort_index(axis=1)
df21 = df2.merge(df1, on=['document_id','item_id']).sort_index(axis=1)

# df12 and df21 are the same.
df12.equals(df21)     # True

This is not true if the columns to be merged on don’t have the same name and you have to pass left_on= and right_on= (see #1 in this answer).

Answered By: cottontail
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.