Pandas Left Outer Join results in table larger than left table

Question:

From what I understand about a left outer join, the resulting table should never have more rows than the left table…Please let me know if this is wrong…

My left table is 192572 rows and 8 columns.

My right table is 42160 rows and 5 columns.

My Left table has a field called ‘id’ which matches with a column in my right table called ‘key’.

Therefore I merge them as such:

combined = pd.merge(a,b,how='left',left_on='id',right_on='key')

But then the combined shape is 236569.

What am I misunderstanding?

Asked By: Terence Chow

||

Answers:

You can expect this to increase if keys match more than one row in the other DataFrame:

In [11]: df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])

In [12]: df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C'])

In [13]: df.merge(df2, how='left')  # merges on columns A
Out[13]: 
   A  B   C
0  1  3   5
1  1  3   6
2  2  4 NaN

To avoid this behaviour drop the duplicates in df2:

In [21]: df2.drop_duplicates(subset=['A'])  # you can use take_last=True
Out[21]: 
   A  C
0  1  5

In [22]: df.merge(df2.drop_duplicates(subset=['A']), how='left')
Out[22]: 
   A  B   C
0  1  3   5
1  2  4 NaN
Answered By: Andy Hayden

There are also strategies you can use to avoid this behavior that don’t involve losing the duplicated data if, for example, not all columns are duplicated. If you have

In [1]: df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])

In [2]: df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C'])

One way would be to take the mean of the duplicate (can also take the sum, etc…)

In [3]: df3 = df2.groupby('A').mean().reset_index()

In [4]: df3
Out[4]: 
     C
A     
1  5.5

In [5]: merged = pd.merge(df,df3,on=['A'], how='outer')

In [6]: merged
Out[204]: 
   A  B    C
0  1  3  5.5
1  2  4  NaN

Alternatively, if you have non-numeric data that cannot be converted using pd.to_numeric() or if you simply do not want to take the mean, you can alter the merging variable by enumerating the duplicates. However, this strategy would apply when the duplicates exist in both datasets (which would cause the same problematic behavior and is also a common problem):

In [7]: df = pd.DataFrame([['a', 3], ['b', 4],['b',0]], columns=['A', 'B'])

In [8]: df2 = pd.DataFrame([['a', 3], ['b', 8],['b',5]], columns=['A', 'C'])

In [9]: df['count'] = df.groupby('A')['B'].cumcount()

In [10]: df['A'] = np.where(df['count']>0,df['A']+df['count'].astype(str),df['A'].astype(str))

In[11]: df
Out[11]: 
    A  B  count
0   a  3      0
1   b  4      0
2  b1  0      1

Do the same for df2, drop the count variables in df and df2 and merge on ‘A’:

In [16]: merged
Out[16]: 
    A  B  C
0   a  3  3        
1   b  4  8        
2  b1  0  5        

A couple of notes. In this last case I use .cumcount() instead of .duplicated because it could be the case that you have more than one duplicate for a given observation. Also, I use .astype(str) to convert the count values to strings because I use the np.where() command, but using pd.concat() or something else might allow for different applications.

Finally, if it is the case that only one dataset has the duplicates but you still want to keep them then you can use the first half of the latter strategy to differentiate the duplicates in the resulting merge.

Answered By: seeiespi

A small addition on the given answers is that there is a parameter named validate which can be used to throw an error if there are duplicated IDs matched in the right table:

combined = pd.merge(a,b,how='left',left_on='id',right_on='key', validate = 'm:1')
Answered By: Tobias Dekker

use drop_duplicates
in your case will be:

merged = pd.merge(df,df3,on=['A'], how='outer').drop_duplicates()
Answered By: Beknazar Osmonov

There could be multiple entries with same key value(s). Make sure there is no duplicates with respect to key in right table.

# One workaround could be remove duplicates from right table w.r.t key.

combined = pd.merge(a.reset_index(),b.drop_duplicates(['key']),how='left',left_on='id',right_on='key')


Answered By: Raushan Kumar

To fix this , create a Unique INDEX column in the LEFT DataFrame, so you can track this "INDEX" column for "Duplicates" after you have the "Merged Dataframe" ready.

  1. LEFT_df[‘INDEX’] = LEFT_df.index + 1
  2. LEFT_df.shape
  3. Merged_df = pd.merge(LEFT_df , Right_df , how = "left", on = ‘Common column’)
  4. LEFT_df[‘INDEX’].duplicated().sum()
  5. Merged_df = Merged_df.drop_duplicates(subset=[‘INDEX’], keep=’first’)
  6. Merged_df.shape (will now match with the LEFT_df.shape)
Answered By: Aisha Khalid
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.