Number of rows changes even after `pandas.merge` with `left` option

Question:

I am merging two data frames using pandas.merge. Even after specifying how = left option, I found the number of rows of merged data frame is larger than the original. Why does this happen?

panel = pd.read_csv(file1, encoding ='cp932')
before_len = len(panel)

prof_2000 = pd.read_csv(file2, encoding ='cp932').drop_duplicates()

temp_2000 = pd.merge(panel, prof_2000, left_on='Candidate_u', right_on="name2", how="left")

after_len =  len(temp_2000)
print(before_len, after_len)
> 12661 13915
Asked By: user51966

||

Answers:

This sounds like having more than one rows in right under 'name2' that match the key you have set for the left. Using option 'how='left' with pandas.DataFrame.merge() only means that:

  • left: use only keys from left frame

However, the actual number of rows in the result object is not necessarily going to be the same as the number of rows in the left object.

Example:

In [359]: df_1
Out[359]: 
   A    B
0  a  AAA
1  b  BBA
2  c  CCF

and then another DF that looks like this (notice that there are more than one entry for your desired key on the left):

In [360]: df_3
Out[360]: 
  key  value
0   a      1
1   a      2
2   b      3
3   a      4

If I merge these two on left.A, here’s what happens:

In [361]: df_1.merge(df_3, how='left', left_on='A', right_on='key')
Out[361]: 
   A    B  key  value
0  a  AAA    a    1.0
1  a  AAA    a    2.0
2  a  AAA    a    4.0
3  b  BBA    b    3.0
4  c  CCF  NaN    NaN

This happened even though I merged with how='left' as you can see above, there were simply more than one rows to merge and as shown here the result pd.DataFrame has in fact more rows than the pd.DataFrame on the left.

I hope this helps!

Answered By: Thanos

The problem of doubling of rows after each merge() (of any type, ‘both’ or ‘left’) is usually caused by duplicates in any of the keys, so we need to drop them first:

left_df.drop_duplicates(subset=left_key, inplace=True)
right_df.drop_duplicates(subset=right_key, inplace=True)
Answered By: mirekphd

If you do not have any duplication, as indicated in the above answer. You should double-check the names of removed entries. In my case, I discovered that the names of removed entries are inconsistent between the df1 and df2 and I solved the problem by:

df1["col1"] = df2["col2"]
Answered By: Hamzah

Do we have any another way to deal this situation instead of drop_duplicates?

How we get the first match of B dataframe values in dataframe A so there is no rows duplication?

Answered By: Madhur
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.