how to join two dataframes for which column values are within a certain range for multiple columns using pandas dataframe?

Question:

Referring to the same question asked here join dataframes for single column , Now I want to extend this for two more columns for example:

df1:

price_start  price_end  year_start  year_end  score
         10         50        2001      2005     20
         60        100        2001      2005     50
         10         50        2006      2010     30

df2:

Price  year
   10  2001
   70  2002
   50  2010

Now I want to map the score from df1 with respect to df2 values.

expected output:

price year score

10 2001 20

70 2002 50

50 2010 30
Asked By: Devaraj Nadiger

||

Answers:

Solution 1: Simple Solution for small dataset

For small dataset, you can cross join df1 and df2 by .merge(), then filter by the conditions where the Price is within range and year is within range using .query() specifying the conditions, as follows:

(df1.merge(df2, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross', you can use:

(df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

Result:

   Price  year  score
0     10  2001     20
4     70  2002     50
8     50  2010     30

Solution 2: Numpy Solution for large dataset

For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:

We look for Price in df2 is within price range in df1 and year in df2 is within year range in df1:

d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1.values[j], df2.values[i]]),
    columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

Result:

   Price  year  score
0     10  2001     20
1     70  2002     50
2     50  2010     30

Performance Comparison

Part 1: Compare for original datasets of 3 rows each:

Solution 1:

%%timeit
(df1.merge(df2, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

5.91 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Solution 2:

%%timeit
d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1.values[j], df2.values[i]]),
    columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

703 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Benchmark summary: 5.91 ms vs 703 µs, that is 8.4x times faster

Part 2: Compare for datasets with 3,000 and 30,000 rows:

Data Setup:

df1a = pd.concat([df1] * 1000, ignore_index=True)
df2a = pd.concat([df2] * 10000, ignore_index=True)

Solution 1:

%%timeit
(df1a.merge(df2a, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

27.5 s ± 3.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 2:

%%timeit
d2_P = df2a.Price.values
d2_Y = df2a.year.values

d1_PS = df1a.price_start.values
d1_PE = df1a.price_end.values
d1_YS = df1a.year_start.values
d1_YE = df1a.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1a.values[j], df2a.values[i]]),
    columns=df1a.columns.append(df2a.columns)
)[['Price', 'year', 'score']]

3.83 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark summary: 27.5 s vs 3.83 s, that is 7.2x times faster

Answered By: SeaBean

One option is with conditional_join from pyjanitor, and is efficient for range joins as well, and is better than a naive cross join:

# pip install pyjanitor
# you can also install the dev version for the latest
# including the ability to use numba for faster performance
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

import janitor
import pandas as pd

(df1
.conditional_join(
    df2, 
    ('price_start', 'Price', '<='), 
    ('price_end', 'Price', '>='), 
    ('year_start', 'year', '<='), 
    ('year_end', 'year', '>='))
.loc(axis=1)['Price','year','score']
)
   Price  year  score
0     10  2001     20
1     70  2002     50
2     50  2010     30

With the dev version, you can select the columns as well:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

import janitor
import pandas as pd

(df1
.conditional_join(
    df2, 
    ('price_start', 'Price', '<='), 
    ('price_end', 'Price', '>='), 
    ('year_start', 'year', '<='), 
    ('year_end', 'year', '>='),
    use_numba = False,
    right_columns = ['Price', 'year'],
    df_columns = 'score')
)
   score  Price  year
0     20     10  2001
1     50     70  2002
2     30     50  2010

With the dev version, if you have numba installed, you can turn on use_numba for more performance.

Answered By: sammywemmy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.