Difference for a column bewteen two dataframes with condition limit

Question:

The context : I have 2 different series of data saved in 2 dataframes :

index object  time
0     45      12.56416
1     30      10.61656
2     5       10.74478
3     8       56.14421
4     1       13.23214
5     45      58.56315

index object  time
0     45      12.56491
1     30      10.61656
2     15      189.74478
3     8       56.14421
4     45      98.23214
5     45      58.56410
6     5       10.74992

For each dataframe, I can have multiple time an object with different time. The goal is to compare the two dataframes between them and to show a result like this :

object time_dataframe1  time_dataframe2  difference
45     12.56416         12.56491         |time_dataframe1-time_dataframe2|
45     58.56315         58.56410         0.00095
30     10.61656         10.61656         0.
8      56.14421         56.14421         0.
5      10.74412         10.74992         0.0058

The particularity here is to compare the "same" couple of objects/time between 2 dataframes but with times closed (with a precision we should fix, here <0.01) and remove all the others.

I could merge the two dataframes but I don’t want compare rows of the dataframe1 itself for example. How can I do to resolve this issue ?

Thank you.

Asked By: Matthmatth03

||

Answers:

Use a merge_asof:

(pd.merge_asof(df1.sort_values(by='time')
                  .rename(columns={'time': 'time_dataframe1'}),
               df2.drop(columns='index').sort_values(by='time')
                  .rename(columns={'time': 'time_dataframe2'}),
               by='object',
               left_on='time_dataframe1', right_on='time_dataframe2',
               direction='nearest', tolerance=0.01
              )
    .dropna(subset=['time_dataframe2'])
    .assign(diff=lambda d: d['time_dataframe1'].sub(d['time_dataframe2']).abs())
    .sort_values(by='object', ascending=False)
)

Output:

   index  object  time_dataframe1  time_dataframe2     diff
2      0      45         12.56416         12.56491  0.00075
5      5      45         58.56315         58.56410  0.00095
0      1      30         10.61656         10.61656  0.00000
4      3       8         56.14421         56.14421  0.00000
1      2       5         10.74478         10.74992  0.00514
Answered By: mozway

Another possible solution:

tolerance = 0.01
aux1 = df1.rename({'index': 'index1'}, axis=1).set_index(['index1', 'object'])
aux2 = df2.rename({'index': 'index2'}, axis=1).set_index(['index2', 'object'])
out = aux1['time'].sub(aux2['time']).abs().rename('diff')
(out[out.le(tolerance)].reset_index()
 .merge(df1, left_on='index1', right_on='index').rename({'time': 'time1'}, axis=1)
 .merge(df2, left_on='index2', right_on='index').rename({'time': 'time2'}, axis=1)
 .loc[:, ['object', 'time1', 'time2', 'diff']])

Output:

   object     time1     time2     diff
0       5  10.74478  10.74992  0.00514
1       8  56.14421  56.14421  0.00000
2      30  10.61656  10.61656  0.00000
3      45  12.56416  12.56491  0.00075
4      45  58.56315  58.56410  0.00095
Answered By: PaulS