Compare two pandas dataframe with different size

Question:

I have one massive pandas dataframe with this structure:

df1:
    A   B
0   0  12
1   0  15
2   0  17
3   0  18
4   1  45
5   1  78
6   1  96
7   1  32
8   2  45
9   2  78
10  2  44
11  2  10

And a second one, smaller like this:

df2
   G   H
0  0  15
1  1  45
2  2  31

I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G

I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.

Many thanks,

Boris

Asked By: boris

||

Answers:

You probably want to use a merge:

df=df1.merge(df2,left_on="A",right_on="G")

will give you a dataframe with 3 columns, but the third one’s name will be H

df.columns=["A","B","C"]

will then give you the column names you want

Answered By: WNG

You can use map by Series created by set_index:

df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
    A   B   C
0   0  12  15
1   0  15  15
2   0  17  15
3   0  18  15
4   1  45  45
5   1  78  45
6   1  96  45
7   1  32  45
8   2  45  31
9   2  78  31
10  2  44  31
11  2  10  31

Or merge with drop and rename:

df = df1.merge(df2,left_on="A",right_on="G", how='left')
        .drop('G', axis=1)
        .rename(columns={'H':'C'})
print (df)
    A   B   C
0   0  12  15
1   0  15  15
2   0  17  15
3   0  18  15
4   1  45  45
5   1  78  45
6   1  96  45
7   1  32  45
8   2  45  31
9   2  78  31
10  2  44  31
11  2  10  31
Answered By: jezrael

Here’s one vectorized NumPy approach –

idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]

idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don’t think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

Answered By: Divakar

If you only want to match mutual rows in both dataframes:

import pandas as pd

df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1    
   Name Special ability
0  Sara   Walk on water

df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
     Name  Age
0    Sara    4
1  Gustaf   12
2  Patrik   11

df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
     Name  Age Special ability
0    Sara    4             NaN
1  Gustaf   12   Walk on water
2  Patrik   11             NaN

This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)

df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})

df1
     Name Special ability  Age
0    Sara   Walk on water   12
1  Patrik       FireBalls   83

df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
     Name  Age
0    Sara    4
1  Gustaf   12
2  Patrik   11

df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
     Name  Age Special ability
0    Sara   12   Walk on water
1  Gustaf   12             NaN
2  Patrik   11             NaN
Answered By: Frans Sjöström
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.