how to perform an inner or outer join of DataFrames with Pandas on non-simplistic criterion

Question:

Given two dataframes as below:

>>> import pandas as pd

>>> df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}])
>>> df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}])
>>> df_a
   a  b
0  1  4
1  2  5
2  3  6

>>> df_b
   c  d
0  2  7
1  3  8

we would like to produce a SQL-style join of both dataframes using a non-simplistic criteria, let’s say “df_b.c > df_a.a”. From what I can tell, while merge() is certainly part of the solution, I can’t use it directly since it doesn’t accept arbitrary expressions for “ON” criteria (unless I’m missing something?).

In SQL, the results look like this:

# inner join
sqlite> select * from df_a join df_b on c > a;
1|4|2|7
1|4|3|8
2|5|3|8

# outer join
sqlite> select * from df_a left outer join df_b on c > a;
1|4|2|7
1|4|3|8
2|5|3|8
3|6||

my current approach for inner join is to produce a cartesian product
of df_a and df_b, by adding a column of “1”s to both, then using
merge() on the “1”s column, then applying the “c > a” criteria.

>>> import numpy as np
>>> df_a['ones'] = np.ones(3)
>>> df_b['ones'] = np.ones(2)
>>> cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')
>>> cartesian
   a  b  ones  c  d
0  1  4     1  2  7
1  1  4     1  3  8
2  2  5     1  2  7
3  2  5     1  3  8
4  3  6     1  2  7
5  3  6     1  3  8
>>> cartesian[cartesian.c > cartesian.a]
   a  b  ones  c  d
0  1  4     1  2  7
1  1  4     1  3  8
3  2  5     1  3  8

for outer join, I’m not sure of the best way to go, so far
I’ve been playing with getting the inner join, then applying the negation
of the criteria to get all the other rows, then trying to edit that
“negation” set onto the original, but it doesn’t really work.

Edit. HYRY answered the specific question here but I needed something more generic and more within the Pandas API, as my join criterion could be anything, not just that one comparison. For outerjoin, first I’m adding an extra index to the “left” side that will maintain itself after I do the inner join:

df_a['_left_index'] = df_a.index

then we do the cartesian and get the inner join:

cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')
innerjoin = cartesian[cartesian.c > cartesian.a]

then I get the additional index ids in “df_a” that we’ll need, and get the rows from “df_a”:

remaining_left_ids = set(df_a['_left_index']).
                    difference(innerjoin['_left_index'])
remaining = df_a.ix[remaining_left_ids]

then we use a straight concat(), which replaces missing columns with “NaN” for left (I thought it wasn’t doing this earlier but I guess it does):

outerjoin = pd.concat([innerjoin, remaining]).reset_index()

HYRY’s idea to do the cartesian on just those cols that we need to compare on is basically the right answer, though in my specific case it might be a little tricky to implement (generalized and all).

questions:

  1. How would you produce a “join” of df_1 and df_2 on “c > a”? Would
    you do the same “cartesian product, filter” approach or is there some better
    way?

  2. How would you produce the “left outer join” of same?

Asked By: zzzeek

||

Answers:

I use the outer method of ufunc to calculate the result, here is the example:

First, some data:

import pandas as pd
import numpy as np
df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}, {"a": 4, "b": 8}, {"a": 1, "b": 7}])
df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}, {"c": 2, "d": 10}])
print "df_a"
print df_a
print "df_b"
print df_b

output:

df_a
   a  b
0  1  4
1  2  5
2  3  6
3  4  8
4  1  7
df_b
   c   d
0  2   7
1  3   8
2  2  10

Inner join, because this only calculate the cartesian product of c & a, memory useage is less than cartesian product of the whole DataFrame:

ia, ib = np.where(np.less.outer(df_a.a, df_b.c))
print pd.concat((df_a.take(ia).reset_index(drop=True), 
                 df_b.take(ib).reset_index(drop=True)), axis=1)

output:

   a  b  c   d
0  1  4  2   7
1  1  4  3   8
2  1  4  2  10
3  2  5  3   8
4  1  7  2   7
5  1  7  3   8
6  1  7  2  10

to calculate the left outer join, use numpy.setdiff1d() to find all the rows of df_a that not in the inner join:

na = np.setdiff1d(np.arange(len(df_a)), ia)
nb = -1 * np.ones_like(na)
oa = np.concatenate((ia, na))
ob = np.concatenate((ib, nb))
print pd.concat([df_a.take(oa).reset_index(drop=True), 
                 df_b.take(ob).reset_index(drop=True)], axis=1)

output:

   a  b   c   d
0  1  4   2   7
1  1  4   3   8
2  1  4   2  10
3  2  5   3   8
4  1  7   2   7
5  1  7   3   8
6  1  7   2  10
7  3  6 NaN NaN
8  4  8 NaN NaN
Answered By: HYRY

This can be done like this with broadcasting and np.where. Use whatever binary operator you want that evaluates to True/False:

import operator as op

df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}])
df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}])

binOp   = op.lt
matches = np.where(binOp(df_a.a[:,None],df_b.c.values))

print pd.concat([df.ix[idxs].reset_index(drop=True) 
                 for df,idxs in zip([df_a,df_b],matches)],
                axis=1).to_csv()

,a,b,c,d

0,1,4,2,7

1,1,4,3,8

2,2,5,3,8

Answered By: jharting

conditional_join from pyjanitor works great for non-equi joins:

# pip install pyjanitor
import pandas as pd
import janitor

inner join

 df_a.conditional_join(df_b, ('a', 'c', '<'))

  left    right
     a  b     c  d
0    1  4     2  7
1    1  4     3  8
2    2  5     3  8

left join

df_a.conditional_join(df_b, ('a', 'c', '<'), how = 'left')

  left    right
     a  b     c    d
0    1  4   2.0  7.0
1    1  4   3.0  8.0
2    2  5   3.0  8.0
3    3  6   NaN  NaN

The function takes a variable (*args) arguments of tuples for the conditions (col from left, col from_right, join operator)

Answered By: sammywemmy
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.