pandas: merged (inner join) data frame has more rows than the original ones
Question:
I am using python 3.4 on Jupyter Notebook, trying to merge two data frame like below:
df_A.shape
(204479, 2)
df_B.shape
(178, 3)
new_df = pd.merge(df_A, df_B, how='inner', on='my_icon_number')
new_df.shape
(266788, 4)
I thought the new_df
merged above should have few rows than df_A
since merge is like an inner join. But why new_df
here actually has more rows than df_A
?
Here is what I actually want:
my df_A
is like:
id my_icon_number
-----------------------------
A1 123
B1 234
C1 123
D1 235
E1 235
F1 400
and my df_B
is like:
my_icon_number color size
-------------------------------------
123 blue small
234 red large
235 yellow medium
Then I want new_df
to be:
id my_icon_number color size
--------------------------------------------------
A1 123 blue small
B1 234 red large
C1 123 blue small
D1 235 yellow medium
E1 235 yellow medium
I don’t really want to remove duplicates of my_icon_number in df_A. Any idea what I missed here?
Answers:
Because you have duplicates of the merge column in both data sets, you’ll get k * m
rows with that merge column value, where k
is the number of rows with that value in data set 1 and m
is the number of rows with that value in data set 2.
try drop_duplicates
dfa = df_A.drop_duplicates(subset=['my_icon_number'])
dfb = df_B.drop_duplicates(subset=['my_icon_number'])
new_df = pd.merge(dfa, dfb, how='inner', on='my_icon_number')
Example
In this example, the only value in common is 4
but I have it 3 times in each data set. That means I should get 9 total rows in the resulting merge, one for every combination.
df_A = pd.DataFrame(dict(my_icon_number=[1, 2, 3, 4, 4, 4], other_column1=range(6)))
df_B = pd.DataFrame(dict(my_icon_number=[4, 4, 4, 5, 6, 7], other_column2=range(6)))
pd.merge(df_A, df_B, how='inner', on='my_icon_number')
my_icon_number other_column1 other_column2
0 4 3 0
1 4 3 1
2 4 3 2
3 4 4 0
4 4 4 1
5 4 4 2
6 4 5 0
7 4 5 1
8 4 5 2
I am using python 3.4 on Jupyter Notebook, trying to merge two data frame like below:
df_A.shape
(204479, 2)
df_B.shape
(178, 3)
new_df = pd.merge(df_A, df_B, how='inner', on='my_icon_number')
new_df.shape
(266788, 4)
I thought the new_df
merged above should have few rows than df_A
since merge is like an inner join. But why new_df
here actually has more rows than df_A
?
Here is what I actually want:
my df_A
is like:
id my_icon_number
-----------------------------
A1 123
B1 234
C1 123
D1 235
E1 235
F1 400
and my df_B
is like:
my_icon_number color size
-------------------------------------
123 blue small
234 red large
235 yellow medium
Then I want new_df
to be:
id my_icon_number color size
--------------------------------------------------
A1 123 blue small
B1 234 red large
C1 123 blue small
D1 235 yellow medium
E1 235 yellow medium
I don’t really want to remove duplicates of my_icon_number in df_A. Any idea what I missed here?
Because you have duplicates of the merge column in both data sets, you’ll get k * m
rows with that merge column value, where k
is the number of rows with that value in data set 1 and m
is the number of rows with that value in data set 2.
try drop_duplicates
dfa = df_A.drop_duplicates(subset=['my_icon_number'])
dfb = df_B.drop_duplicates(subset=['my_icon_number'])
new_df = pd.merge(dfa, dfb, how='inner', on='my_icon_number')
Example
In this example, the only value in common is 4
but I have it 3 times in each data set. That means I should get 9 total rows in the resulting merge, one for every combination.
df_A = pd.DataFrame(dict(my_icon_number=[1, 2, 3, 4, 4, 4], other_column1=range(6)))
df_B = pd.DataFrame(dict(my_icon_number=[4, 4, 4, 5, 6, 7], other_column2=range(6)))
pd.merge(df_A, df_B, how='inner', on='my_icon_number')
my_icon_number other_column1 other_column2
0 4 3 0
1 4 3 1
2 4 3 2
3 4 4 0
4 4 4 1
5 4 4 2
6 4 5 0
7 4 5 1
8 4 5 2