how to merge 2 pandas dataframes of different sizes/indices on floor(value x, value y)
Question:
Given 2 pandas dataframes:
df1 = pd.DataFrame({col1: [0.5, 0.75, 1.1, 1.6, 2, 3, 5.5, 10, 11.2] })
df2 = pd.DataFrame({col2: [0, 3, 10,15] })
Each of the df1[col1]
value is within the range of df2[col2]
values:
df2[col2].iloc[y] <= df1[col1].iloc[x] < df2[col2].iloc[y+1]
How to merge df1 and df2 in a way that each value from df1[col1]
equals to the min value of fitting range from df2[col2]
. E.g. df1[col1].iloc[1] = 0.75
it resides between df2[col2].iloc[0]
and df2[col2].iloc[1]
(0.75 fits the range: 0, 3) so df1['result'].iloc[1] = df2[col2].iloc[0]
expected result:
df1['result'] = [0, 0, 0, 0, 0, 3, 3, 10, 10]
Answers:
Use custom generator to produce the needed sequence:
def gen_range_bounds(s1, s2):
ranges = list(zip(s2[:-1], s2[1:])) # collect consecutive ranges
for v in s1:
for low, high in ranges:
if low <= v < high:
yield low # yield min bound of the range
break
df1['result'] = list(gen_range_bounds(df1['col1'], df2['col2']))
print(df1)
col1 result
0 0.50 0
1 0.75 0
2 1.10 0
3 1.60 0
4 2.00 0
5 3.00 3
6 5.50 3
7 10.00 10
8 11.20 10
Use pd.Series.values.searchsorted()
, which returns indices where elements should be inserted to maintain order.
for example:
df1 = pd.DataFrame({'col1': [0.5, 0.75, 1.1, 1.6, 2, 3, 5.5, 10, 11.2] })
df2 = pd.DataFrame({'col2': [0, 3, 10,15] })
df2['col2'].values.searchsorted(0.5) # return 1
df2['col2'].values.searchsorted(5.5) # return 2
df2['col2'].values.searchsorted(10) # return 2
You want value instead of indices, so like this:
# get indices, return: [1 1 1 1 1 2 2 3 3]
indices = df2['col2'].values.searchsorted(df1['col1'], side='right')
# get values, return: [0, 0, 0, 0, 0, 3, 3, 10, 10]
df1['result'] = [df2['col2'].iloc[i-1] for i in indices]
Looks like some form of inequality join – if that is the case, you can use conditional_join from pyjanitor to get your results
# pip install pyjanitor
import pandas as pd
import janitor
(df1
.conditional_join(
df2.astype(float).assign(col3 = lambda f: f.col2.shift(-1).fillna(f.col2)),
('col1', 'col2', '>='), ('col1', 'col3', '<'),
right_columns='col2')
)
col1 col2
0 0.50 0.0
1 0.75 0.0
2 1.10 0.0
3 1.60 0.0
4 2.00 0.0
5 3.00 3.0
6 5.50 3.0
7 10.00 10.0
8 11.20 10.0
Given 2 pandas dataframes:
df1 = pd.DataFrame({col1: [0.5, 0.75, 1.1, 1.6, 2, 3, 5.5, 10, 11.2] })
df2 = pd.DataFrame({col2: [0, 3, 10,15] })
Each of the df1[col1]
value is within the range of df2[col2]
values:
df2[col2].iloc[y] <= df1[col1].iloc[x] < df2[col2].iloc[y+1]
How to merge df1 and df2 in a way that each value from df1[col1]
equals to the min value of fitting range from df2[col2]
. E.g. df1[col1].iloc[1] = 0.75
it resides between df2[col2].iloc[0]
and df2[col2].iloc[1]
(0.75 fits the range: 0, 3) so df1['result'].iloc[1] = df2[col2].iloc[0]
expected result:
df1['result'] = [0, 0, 0, 0, 0, 3, 3, 10, 10]
Use custom generator to produce the needed sequence:
def gen_range_bounds(s1, s2):
ranges = list(zip(s2[:-1], s2[1:])) # collect consecutive ranges
for v in s1:
for low, high in ranges:
if low <= v < high:
yield low # yield min bound of the range
break
df1['result'] = list(gen_range_bounds(df1['col1'], df2['col2']))
print(df1)
col1 result
0 0.50 0
1 0.75 0
2 1.10 0
3 1.60 0
4 2.00 0
5 3.00 3
6 5.50 3
7 10.00 10
8 11.20 10
Use pd.Series.values.searchsorted()
, which returns indices where elements should be inserted to maintain order.
for example:
df1 = pd.DataFrame({'col1': [0.5, 0.75, 1.1, 1.6, 2, 3, 5.5, 10, 11.2] })
df2 = pd.DataFrame({'col2': [0, 3, 10,15] })
df2['col2'].values.searchsorted(0.5) # return 1
df2['col2'].values.searchsorted(5.5) # return 2
df2['col2'].values.searchsorted(10) # return 2
You want value instead of indices, so like this:
# get indices, return: [1 1 1 1 1 2 2 3 3]
indices = df2['col2'].values.searchsorted(df1['col1'], side='right')
# get values, return: [0, 0, 0, 0, 0, 3, 3, 10, 10]
df1['result'] = [df2['col2'].iloc[i-1] for i in indices]
Looks like some form of inequality join – if that is the case, you can use conditional_join from pyjanitor to get your results
# pip install pyjanitor
import pandas as pd
import janitor
(df1
.conditional_join(
df2.astype(float).assign(col3 = lambda f: f.col2.shift(-1).fillna(f.col2)),
('col1', 'col2', '>='), ('col1', 'col3', '<'),
right_columns='col2')
)
col1 col2
0 0.50 0.0
1 0.75 0.0
2 1.10 0.0
3 1.60 0.0
4 2.00 0.0
5 3.00 3.0
6 5.50 3.0
7 10.00 10.0
8 11.20 10.0