how to join two dataframes for which column values are within a certain range for multiple columns using pandas dataframe?
Question:
Referring to the same question asked here join dataframes for single column , Now I want to extend this for two more columns for example:
df1:
price_start price_end year_start year_end score
10 50 2001 2005 20
60 100 2001 2005 50
10 50 2006 2010 30
df2:
Price year
10 2001
70 2002
50 2010
Now I want to map the score from df1 with respect to df2 values.
expected output:
price year score
10 2001 20
70 2002 50
50 2010 30
Answers:
Solution 1: Simple Solution for small dataset
For small dataset, you can cross join df1
and df2
by .merge()
, then filter by the conditions where the Price is within range and year is within range using .query()
specifying the conditions, as follows:
(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross'
, you can use:
(df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
Result:
Price year score
0 10 2001 20
4 70 2002 50
8 50 2010 30
Solution 2: Numpy Solution for large dataset
For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:
We look for Price
in df2
is within price range in df1
and year
in df2
is within year range in df1
:
d2_P = df2.Price.values
d2_Y = df2.year.values
d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]
Result:
Price year score
0 10 2001 20
1 70 2002 50
2 50 2010 30
Performance Comparison
Part 1: Compare for original datasets of 3 rows each:
Solution 1:
%%timeit
(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
5.91 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Solution 2:
%%timeit
d2_P = df2.Price.values
d2_Y = df2.year.values
d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]
703 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Benchmark summary: 5.91 ms vs 703 µs, that is 8.4x times faster
Part 2: Compare for datasets with 3,000 and 30,000 rows:
Data Setup:
df1a = pd.concat([df1] * 1000, ignore_index=True)
df2a = pd.concat([df2] * 10000, ignore_index=True)
Solution 1:
%%timeit
(df1a.merge(df2a, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
27.5 s ± 3.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2:
%%timeit
d2_P = df2a.Price.values
d2_Y = df2a.year.values
d1_PS = df1a.price_start.values
d1_PE = df1a.price_end.values
d1_YS = df1a.year_start.values
d1_YE = df1a.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1a.values[j], df2a.values[i]]),
columns=df1a.columns.append(df2a.columns)
)[['Price', 'year', 'score']]
3.83 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Benchmark summary: 27.5 s vs 3.83 s, that is 7.2x times faster
One option is with conditional_join from pyjanitor, and is efficient for range joins as well, and is better than a naive cross join:
# pip install pyjanitor
# you can also install the dev version for the latest
# including the ability to use numba for faster performance
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(df1
.conditional_join(
df2,
('price_start', 'Price', '<='),
('price_end', 'Price', '>='),
('year_start', 'year', '<='),
('year_end', 'year', '>='))
.loc(axis=1)['Price','year','score']
)
Price year score
0 10 2001 20
1 70 2002 50
2 50 2010 30
With the dev
version, you can select the columns as well:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(df1
.conditional_join(
df2,
('price_start', 'Price', '<='),
('price_end', 'Price', '>='),
('year_start', 'year', '<='),
('year_end', 'year', '>='),
use_numba = False,
right_columns = ['Price', 'year'],
df_columns = 'score')
)
score Price year
0 20 10 2001
1 50 70 2002
2 30 50 2010
With the dev
version, if you have numba installed, you can turn on use_numba
for more performance.
Referring to the same question asked here join dataframes for single column , Now I want to extend this for two more columns for example:
df1:
price_start price_end year_start year_end score
10 50 2001 2005 20
60 100 2001 2005 50
10 50 2006 2010 30
df2:
Price year
10 2001
70 2002
50 2010
Now I want to map the score from df1 with respect to df2 values.
expected output:
price year score
10 2001 20
70 2002 50
50 2010 30
Solution 1: Simple Solution for small dataset
For small dataset, you can cross join df1
and df2
by .merge()
, then filter by the conditions where the Price is within range and year is within range using .query()
specifying the conditions, as follows:
(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross'
, you can use:
(df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
Result:
Price year score
0 10 2001 20
4 70 2002 50
8 50 2010 30
Solution 2: Numpy Solution for large dataset
For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:
We look for Price
in df2
is within price range in df1
and year
in df2
is within year range in df1
:
d2_P = df2.Price.values
d2_Y = df2.year.values
d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]
Result:
Price year score
0 10 2001 20
1 70 2002 50
2 50 2010 30
Performance Comparison
Part 1: Compare for original datasets of 3 rows each:
Solution 1:
%%timeit
(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
5.91 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Solution 2:
%%timeit
d2_P = df2.Price.values
d2_Y = df2.year.values
d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]
703 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Benchmark summary: 5.91 ms vs 703 µs, that is 8.4x times faster
Part 2: Compare for datasets with 3,000 and 30,000 rows:
Data Setup:
df1a = pd.concat([df1] * 1000, ignore_index=True)
df2a = pd.concat([df2] * 10000, ignore_index=True)
Solution 1:
%%timeit
(df1a.merge(df2a, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
27.5 s ± 3.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2:
%%timeit
d2_P = df2a.Price.values
d2_Y = df2a.year.values
d1_PS = df1a.price_start.values
d1_PE = df1a.price_end.values
d1_YS = df1a.year_start.values
d1_YE = df1a.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1a.values[j], df2a.values[i]]),
columns=df1a.columns.append(df2a.columns)
)[['Price', 'year', 'score']]
3.83 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Benchmark summary: 27.5 s vs 3.83 s, that is 7.2x times faster
One option is with conditional_join from pyjanitor, and is efficient for range joins as well, and is better than a naive cross join:
# pip install pyjanitor
# you can also install the dev version for the latest
# including the ability to use numba for faster performance
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(df1
.conditional_join(
df2,
('price_start', 'Price', '<='),
('price_end', 'Price', '>='),
('year_start', 'year', '<='),
('year_end', 'year', '>='))
.loc(axis=1)['Price','year','score']
)
Price year score
0 10 2001 20
1 70 2002 50
2 50 2010 30
With the dev
version, you can select the columns as well:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(df1
.conditional_join(
df2,
('price_start', 'Price', '<='),
('price_end', 'Price', '>='),
('year_start', 'year', '<='),
('year_end', 'year', '>='),
use_numba = False,
right_columns = ['Price', 'year'],
df_columns = 'score')
)
score Price year
0 20 10 2001
1 50 70 2002
2 30 50 2010
With the dev
version, if you have numba installed, you can turn on use_numba
for more performance.