Intersection of two pandas dataframes based on column entries

Question

Suppose I have two DataFrames like so:

>>dfA
        S        T            prob
0   ! ! !    ! ! !   8.1623999e-05
1   ! ! !  ! ! ! "   0.00354090007
2   ! ! !  ! ! ! .   0.00210241997
3   ! ! !  ! ! ! ?  6.55684998e-05
4   ! ! !  ! ! !       0.203119993
5   ! ! !  ! ! ! ”  6.62070015e-05
6   ! ! !      ! !   0.00481862016
7   ! ! !        !    0.0274260994
8   ! ! !  " ! ! !  7.99940026e-05
9   ! ! !      " !  1.51188997e-05
10  ! ! !        "  8.50678989e-05

>>dfB
        S        T                              knstats
0   ! ! !  ! ! ! !              knstats=2,391,104,64,25
1   ! ! !  ! ! ! "                 knstats=4,391,6,64,2
2   ! ! !  ! ! ! .                 knstats=4,391,5,64,2
3   ! ! !  ! ! ! ?                 knstats=1,391,4,64,4
4   ! ! !    ! ! !            knstats=220,391,303,64,55
5   ! ! !      ! !            knstats=16,391,957,64,115
6   ! ! !        !           knstats=28,391,5659,64,932
7   ! ! !  " ! ! !                 knstats=2,391,2,64,1
8   ! ! !      " !               knstats=1,391,37,64,13
9   ! ! !        "  knstats=2,391,1.11721e+06,64,180642
10  ! !        . "        knstats=2,391,120527,64,20368

I want to create a new DataFrame which is composed of the rows which have matching "S" and "T" entries in both matrices, along with the prob column from dfA and the knstats column from dfB. The result should look something like the following, and it is important that the order is the same:

       S        T            prob                              knstats
0  ! ! !  ! ! ! !   8.1623999e-05              knstats=2,391,104,64,25
1  ! ! !  ! ! ! "   0.00354090007                 knstats=4,391,6,64,2
2  ! ! !  ! ! ! .   0.00210241997                 knstats=4,391,5,64,2
3  ! ! !  ! ! ! ?  6.55684998e-05                 knstats=1,391,4,64,4
4  ! ! !    ! ! !     0.203119993            knstats=220,391,303,64,55
5  ! ! !      ! !   0.00481862016            knstats=16,391,957,64,115
6  ! ! !        !    0.0274260994           knstats=28,391,5659,64,932
7  ! ! !  " ! ! !  7.99940026e-05                 knstats=2,391,2,64,1
8  ! ! !      " !  1.51188997e-05               knstats=1,391,37,64,13
9  ! ! !        "  8.50678989e-05  knstats=2,391,1.11721e+06,64,180642

Asked By: Bib

||

Source

Answer 1

You can merge them so:

s1 = pd.merge(dfA, dfB, how='inner', on=['S', 'T'])

To drop NA rows:

s1.dropna(inplace=True)

Answered By: user308827

Answer 2

To select the rows from the two dataframes where S and T values intersect, one could use Index.intersection(). The idea is to find the common rows (i.e. intersection) in S and T columns in the two dataframes and select only those rows.

# convert S and T columns into MultiIndex
dfA_idx = pd.MultiIndex.from_frame(dfA[['S', 'T']])
dfB_idx = pd.MultiIndex.from_frame(dfB[['S', 'T']])
# get intersecting rows in S and T
common_idx = dfA_idx.intersection(dfB_idx)
# filter rows where S and T intersect
dfA_common = dfA.set_index(['S', 'T']).reindex(common_idx).reset_index()
dfB_common = dfB.set_index(['S', 'T']).reindex(common_idx).reset_index()

The concatenation of these frames will produce the same result as merge(), i.e.

x = dfA_common.join(dfB_common[['knstats']])
y = dfA.merge(dfB, on=['S', 'T'])
x.equals(y)                                    # True

Answered By: cottontail

Intersection of two pandas dataframes based on column entries

Question:

Answers: