Difference for a column bewteen two dataframes with condition limit
Question:
The context : I have 2 different series of data saved in 2 dataframes :
index object time
0 45 12.56416
1 30 10.61656
2 5 10.74478
3 8 56.14421
4 1 13.23214
5 45 58.56315
index object time
0 45 12.56491
1 30 10.61656
2 15 189.74478
3 8 56.14421
4 45 98.23214
5 45 58.56410
6 5 10.74992
For each dataframe, I can have multiple time an object with different time. The goal is to compare the two dataframes between them and to show a result like this :
object time_dataframe1 time_dataframe2 difference
45 12.56416 12.56491 |time_dataframe1-time_dataframe2|
45 58.56315 58.56410 0.00095
30 10.61656 10.61656 0.
8 56.14421 56.14421 0.
5 10.74412 10.74992 0.0058
The particularity here is to compare the "same" couple of objects/time between 2 dataframes but with times closed (with a precision we should fix, here <0.01) and remove all the others.
I could merge the two dataframes but I don’t want compare rows of the dataframe1 itself for example. How can I do to resolve this issue ?
Thank you.
Answers:
Use a merge_asof
:
(pd.merge_asof(df1.sort_values(by='time')
.rename(columns={'time': 'time_dataframe1'}),
df2.drop(columns='index').sort_values(by='time')
.rename(columns={'time': 'time_dataframe2'}),
by='object',
left_on='time_dataframe1', right_on='time_dataframe2',
direction='nearest', tolerance=0.01
)
.dropna(subset=['time_dataframe2'])
.assign(diff=lambda d: d['time_dataframe1'].sub(d['time_dataframe2']).abs())
.sort_values(by='object', ascending=False)
)
Output:
index object time_dataframe1 time_dataframe2 diff
2 0 45 12.56416 12.56491 0.00075
5 5 45 58.56315 58.56410 0.00095
0 1 30 10.61656 10.61656 0.00000
4 3 8 56.14421 56.14421 0.00000
1 2 5 10.74478 10.74992 0.00514
Another possible solution:
tolerance = 0.01
aux1 = df1.rename({'index': 'index1'}, axis=1).set_index(['index1', 'object'])
aux2 = df2.rename({'index': 'index2'}, axis=1).set_index(['index2', 'object'])
out = aux1['time'].sub(aux2['time']).abs().rename('diff')
(out[out.le(tolerance)].reset_index()
.merge(df1, left_on='index1', right_on='index').rename({'time': 'time1'}, axis=1)
.merge(df2, left_on='index2', right_on='index').rename({'time': 'time2'}, axis=1)
.loc[:, ['object', 'time1', 'time2', 'diff']])
Output:
object time1 time2 diff
0 5 10.74478 10.74992 0.00514
1 8 56.14421 56.14421 0.00000
2 30 10.61656 10.61656 0.00000
3 45 12.56416 12.56491 0.00075
4 45 58.56315 58.56410 0.00095
The context : I have 2 different series of data saved in 2 dataframes :
index object time
0 45 12.56416
1 30 10.61656
2 5 10.74478
3 8 56.14421
4 1 13.23214
5 45 58.56315
index object time
0 45 12.56491
1 30 10.61656
2 15 189.74478
3 8 56.14421
4 45 98.23214
5 45 58.56410
6 5 10.74992
For each dataframe, I can have multiple time an object with different time. The goal is to compare the two dataframes between them and to show a result like this :
object time_dataframe1 time_dataframe2 difference
45 12.56416 12.56491 |time_dataframe1-time_dataframe2|
45 58.56315 58.56410 0.00095
30 10.61656 10.61656 0.
8 56.14421 56.14421 0.
5 10.74412 10.74992 0.0058
The particularity here is to compare the "same" couple of objects/time between 2 dataframes but with times closed (with a precision we should fix, here <0.01) and remove all the others.
I could merge the two dataframes but I don’t want compare rows of the dataframe1 itself for example. How can I do to resolve this issue ?
Thank you.
Use a merge_asof
:
(pd.merge_asof(df1.sort_values(by='time')
.rename(columns={'time': 'time_dataframe1'}),
df2.drop(columns='index').sort_values(by='time')
.rename(columns={'time': 'time_dataframe2'}),
by='object',
left_on='time_dataframe1', right_on='time_dataframe2',
direction='nearest', tolerance=0.01
)
.dropna(subset=['time_dataframe2'])
.assign(diff=lambda d: d['time_dataframe1'].sub(d['time_dataframe2']).abs())
.sort_values(by='object', ascending=False)
)
Output:
index object time_dataframe1 time_dataframe2 diff
2 0 45 12.56416 12.56491 0.00075
5 5 45 58.56315 58.56410 0.00095
0 1 30 10.61656 10.61656 0.00000
4 3 8 56.14421 56.14421 0.00000
1 2 5 10.74478 10.74992 0.00514
Another possible solution:
tolerance = 0.01
aux1 = df1.rename({'index': 'index1'}, axis=1).set_index(['index1', 'object'])
aux2 = df2.rename({'index': 'index2'}, axis=1).set_index(['index2', 'object'])
out = aux1['time'].sub(aux2['time']).abs().rename('diff')
(out[out.le(tolerance)].reset_index()
.merge(df1, left_on='index1', right_on='index').rename({'time': 'time1'}, axis=1)
.merge(df2, left_on='index2', right_on='index').rename({'time': 'time2'}, axis=1)
.loc[:, ['object', 'time1', 'time2', 'diff']])
Output:
object time1 time2 diff
0 5 10.74478 10.74992 0.00514
1 8 56.14421 56.14421 0.00000
2 30 10.61656 10.61656 0.00000
3 45 12.56416 12.56491 0.00075
4 45 58.56315 58.56410 0.00095