Apply lambda involving 2 dataframes in pandas

Question:

I would like to select data in df2 based on time intervals in df1:

df1:

ID      Start                End                  col1   col2
23468   2011-01-03 01:01:03  2011-01-03 01:04:05  10     a
23468   2011-01-15 08:20:00  2011-01-18 01:01:01  50     b
23468   2011-02-03 01:07:20  2011-02-08 12:00:03  150    a
33525   2011-02-03 01:07:19  2011-02-06 12:00:03  10     a
                      ...

df2:

ID     Timestap             col3   col4
23468  2011-01-03 01:01:03  3     aa
23468  2011-01-03 01:02:00  4     bb
23468  2011-01-03 12:01:03  7     aa
33525  2011-02-03 02:31:03  10    aa
33525  2011-02-04 12:01:03  20    aa
33525  2011-02-05 14:00:01  30    aa
                      ...

I need to filter df2 if ID matches that in df1 and Timestamp is in between Start and End in df1, then calculate the average of col3 of the group and create a new column in df1 called Average, expected output:

ID      Start                End                  col1   col2   Average
23468   2011-01-03 01:01:03  2011-01-03 01:04:05  10     a      3.5
23468   2011-01-15 08:20:00  2011-01-18 01:01:01  50     b      nan
23468   2011-02-03 01:07:20  2011-02-08 12:00:03  150    a      nan
                      ...

33525   2011-02-03 01:07:19  2011-02-06 12:00:03  10     a      20
                      ...

I tried using a for-loop but it takes ages as the dataframes are too large and merging two dataframes will take even longer, I am wondering if apply() with lambda expression can solve this issue? How do I refer to time intervals from another df?


Update:
What if I want to filter df2 based on time intervals in df1, then find the difference between the first and last col3(ie. 6-3 =3), divide this value by the difference between the first and last Timestamp(ie. 2011-01-03 01:03:03 – 2011-01-03 01:01:03 = 120 seconds). So expected value is 3/120=0.025.

df1:

ID      Start                End                  col1   col2
23468   2011-01-03 01:01:03  2011-01-03 01:04:05  10     a     (*)
23468   2011-01-15 08:20:00  2011-01-18 01:01:01  50     b
23468   2011-02-03 01:07:20  2011-02-08 12:00:03  150    a
33525   2011-02-03 01:07:19  2011-02-06 12:00:03  10     a
                      ...

df2:

ID     Timestap             col3   col4
23468  2011-01-03 01:01:03  3     aa   first row in time interval (*)
23468  2011-01-03 01:02:00  4     bb
23468  2011-01-03 01:03:03  6     aa   last row in time interval (*)   
23468  2011-01-03 12:01:03  7     aa
33525  2011-02-03 02:31:03  10    aa
33525  2011-02-04 12:01:03  20    aa
33525  2011-02-05 14:00:01  30    aa

So the expected output:

ID      Start                End                  col1   col2   Average
23468   2011-01-03 01:01:03  2011-01-03 01:04:05  10     a      0.025       (3/120=0.025)
23468   2011-01-15 08:20:00  2011-01-18 01:01:01  50     b      nan
23468   2011-02-03 01:07:20  2011-02-08 12:00:03  150    a      nan
                      ...

33525   2011-02-03 01:07:19  2011-02-06 12:00:03  10     a 0.000067032     (20/298364 =0.000067032)    
                      ...
Asked By: nilsinelabore

||

Answers:

You can use a list generator, it is several times faster than a loop.
In rows 2, it should be 20 if I’m not mistaken.
df1.iloc on the left are row indices, on the right are column numbers. In df2, explicit indexing is used: on the left is a boolean mask as indexes (True or False), on the right is the name of the column.

import pandas as pd

df1[['Start', 'End']] = df1[['Start', 'End']].apply(pd.to_datetime)
df2['Timestap'] = pd.to_datetime(df2['Timestap'])

aaa = [df2.loc[(df2['Timestap'] >= df1.iloc[i, 1])
               & (df2['Timestap'] <= df1.iloc[i, 2]), 'col3'].mean() for i in range(len(df1))]

df1['Average'] = aaa
print(df1)

Output

      ID               Start                 End  col1 col2  Average
0  23468 2011-01-03 01:01:03 2011-01-03 01:04:05    10    a      3.5
1  23468 2011-01-15 08:20:00 2011-01-18 01:01:01    50    b      NaN
2  23468 2011-02-03 01:07:20 2011-02-08 12:00:03   150    a     20.0
3  33525 2011-02-03 01:07:19 2011-02-06 12:00:03    10    a     20.0

Update 26.08.2022.
We write the indexes of matching rows from df2 into the df1[‘123’] column.

In ‘ind’, copy the indices where greater than or equal to two in each rows.

‘qqq’ get the difference between the last and first element of each row.

‘ttt’ difference in seconds.

‘tq’ we get what we need and store it in the df1[‘test’] column.

import pandas as pd
import numpy as np

df1[['Start', 'End']] = df1[['Start', 'End']].apply(pd.to_datetime)
df2['Timestap'] = pd.to_datetime(df2['Timestap'])

df1['123'] = [df2.loc[(df2['Timestap'] >= df1.iloc[i, 1])
                      & (df2['Timestap'] <= df1.iloc[i, 2]), 'col3'].values for i in range(len(df1))]

ind = df1[df1['123'].str.len().ge(2)].index
qqq = df1.loc[ind, '123'].str[-1] - df1.loc[ind, '123'].str[0]
ttt = (df1.loc[ind, 'End'] - df1.loc[ind, 'Start']) / np.timedelta64(1, 's')
tq = qqq / ttt
df1['test'] = tq

print(df1)

Output

      ID               Start                 End  ...  col2           123      test
0  23468 2011-01-03 01:01:03 2011-01-03 01:04:05  ...     a     [3, 4, 6]  0.016484
1  23468 2011-01-15 08:20:00 2011-01-18 01:01:01  ...     b            []       NaN
2  23468 2011-02-03 01:07:20 2011-02-08 12:00:03  ...     a  [10, 20, 30]  0.000042
3  33525 2011-02-03 01:07:19 2011-02-06 12:00:03  ...     a  [10, 20, 30]  0.000067
Answered By: inquirer
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.