Apply lambda involving 2 dataframes in pandas
Question:
I would like to select data in df2 based on time intervals in df1:
df1:
ID Start End col1 col2
23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a
23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b
23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a
33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a
...
df2:
ID Timestap col3 col4
23468 2011-01-03 01:01:03 3 aa
23468 2011-01-03 01:02:00 4 bb
23468 2011-01-03 12:01:03 7 aa
33525 2011-02-03 02:31:03 10 aa
33525 2011-02-04 12:01:03 20 aa
33525 2011-02-05 14:00:01 30 aa
...
I need to filter df2
if ID
matches that in df1
and Timestamp
is in between Start
and End
in df1
, then calculate the average of col3
of the group and create a new column in df1
called Average
, expected output:
ID Start End col1 col2 Average
23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a 3.5
23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b nan
23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a nan
...
33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a 20
...
I tried using a for-loop but it takes ages as the dataframes are too large and merging two dataframes will take even longer, I am wondering if apply()
with lambda expression can solve this issue? How do I refer to time intervals from another df?
Update:
What if I want to filter df2
based on time intervals in df1
, then find the difference between the first and last col3
(ie. 6-3 =3), divide this value by the difference between the first and last Timestamp
(ie. 2011-01-03 01:03:03 – 2011-01-03 01:01:03 = 120 seconds). So expected value is 3/120=0.025
.
df1:
ID Start End col1 col2
23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a (*)
23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b
23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a
33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a
...
df2:
ID Timestap col3 col4
23468 2011-01-03 01:01:03 3 aa first row in time interval (*)
23468 2011-01-03 01:02:00 4 bb
23468 2011-01-03 01:03:03 6 aa last row in time interval (*)
23468 2011-01-03 12:01:03 7 aa
33525 2011-02-03 02:31:03 10 aa
33525 2011-02-04 12:01:03 20 aa
33525 2011-02-05 14:00:01 30 aa
So the expected output:
ID Start End col1 col2 Average
23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a 0.025 (3/120=0.025)
23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b nan
23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a nan
...
33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a 0.000067032 (20/298364 =0.000067032)
...
Answers:
You can use a list generator, it is several times faster than a loop.
In rows 2, it should be 20 if I’m not mistaken.
df1.iloc on the left are row indices, on the right are column numbers. In df2, explicit indexing is used: on the left is a boolean mask as indexes (True or False), on the right is the name of the column.
import pandas as pd
df1[['Start', 'End']] = df1[['Start', 'End']].apply(pd.to_datetime)
df2['Timestap'] = pd.to_datetime(df2['Timestap'])
aaa = [df2.loc[(df2['Timestap'] >= df1.iloc[i, 1])
& (df2['Timestap'] <= df1.iloc[i, 2]), 'col3'].mean() for i in range(len(df1))]
df1['Average'] = aaa
print(df1)
Output
ID Start End col1 col2 Average
0 23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a 3.5
1 23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b NaN
2 23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a 20.0
3 33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a 20.0
Update 26.08.2022.
We write the indexes of matching rows from df2 into the df1[‘123’] column.
In ‘ind’, copy the indices where greater than or equal to two in each rows.
‘qqq’ get the difference between the last and first element of each row.
‘ttt’ difference in seconds.
‘tq’ we get what we need and store it in the df1[‘test’] column.
import pandas as pd
import numpy as np
df1[['Start', 'End']] = df1[['Start', 'End']].apply(pd.to_datetime)
df2['Timestap'] = pd.to_datetime(df2['Timestap'])
df1['123'] = [df2.loc[(df2['Timestap'] >= df1.iloc[i, 1])
& (df2['Timestap'] <= df1.iloc[i, 2]), 'col3'].values for i in range(len(df1))]
ind = df1[df1['123'].str.len().ge(2)].index
qqq = df1.loc[ind, '123'].str[-1] - df1.loc[ind, '123'].str[0]
ttt = (df1.loc[ind, 'End'] - df1.loc[ind, 'Start']) / np.timedelta64(1, 's')
tq = qqq / ttt
df1['test'] = tq
print(df1)
Output
ID Start End ... col2 123 test
0 23468 2011-01-03 01:01:03 2011-01-03 01:04:05 ... a [3, 4, 6] 0.016484
1 23468 2011-01-15 08:20:00 2011-01-18 01:01:01 ... b [] NaN
2 23468 2011-02-03 01:07:20 2011-02-08 12:00:03 ... a [10, 20, 30] 0.000042
3 33525 2011-02-03 01:07:19 2011-02-06 12:00:03 ... a [10, 20, 30] 0.000067
I would like to select data in df2 based on time intervals in df1:
df1:
ID Start End col1 col2
23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a
23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b
23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a
33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a
...
df2:
ID Timestap col3 col4
23468 2011-01-03 01:01:03 3 aa
23468 2011-01-03 01:02:00 4 bb
23468 2011-01-03 12:01:03 7 aa
33525 2011-02-03 02:31:03 10 aa
33525 2011-02-04 12:01:03 20 aa
33525 2011-02-05 14:00:01 30 aa
...
I need to filter df2
if ID
matches that in df1
and Timestamp
is in between Start
and End
in df1
, then calculate the average of col3
of the group and create a new column in df1
called Average
, expected output:
ID Start End col1 col2 Average
23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a 3.5
23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b nan
23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a nan
...
33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a 20
...
I tried using a for-loop but it takes ages as the dataframes are too large and merging two dataframes will take even longer, I am wondering if apply()
with lambda expression can solve this issue? How do I refer to time intervals from another df?
Update:
What if I want to filter df2
based on time intervals in df1
, then find the difference between the first and last col3
(ie. 6-3 =3), divide this value by the difference between the first and last Timestamp
(ie. 2011-01-03 01:03:03 – 2011-01-03 01:01:03 = 120 seconds). So expected value is 3/120=0.025
.
df1:
ID Start End col1 col2
23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a (*)
23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b
23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a
33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a
...
df2:
ID Timestap col3 col4
23468 2011-01-03 01:01:03 3 aa first row in time interval (*)
23468 2011-01-03 01:02:00 4 bb
23468 2011-01-03 01:03:03 6 aa last row in time interval (*)
23468 2011-01-03 12:01:03 7 aa
33525 2011-02-03 02:31:03 10 aa
33525 2011-02-04 12:01:03 20 aa
33525 2011-02-05 14:00:01 30 aa
So the expected output:
ID Start End col1 col2 Average
23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a 0.025 (3/120=0.025)
23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b nan
23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a nan
...
33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a 0.000067032 (20/298364 =0.000067032)
...
You can use a list generator, it is several times faster than a loop.
In rows 2, it should be 20 if I’m not mistaken.
df1.iloc on the left are row indices, on the right are column numbers. In df2, explicit indexing is used: on the left is a boolean mask as indexes (True or False), on the right is the name of the column.
import pandas as pd
df1[['Start', 'End']] = df1[['Start', 'End']].apply(pd.to_datetime)
df2['Timestap'] = pd.to_datetime(df2['Timestap'])
aaa = [df2.loc[(df2['Timestap'] >= df1.iloc[i, 1])
& (df2['Timestap'] <= df1.iloc[i, 2]), 'col3'].mean() for i in range(len(df1))]
df1['Average'] = aaa
print(df1)
Output
ID Start End col1 col2 Average
0 23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a 3.5
1 23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b NaN
2 23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a 20.0
3 33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a 20.0
Update 26.08.2022.
We write the indexes of matching rows from df2 into the df1[‘123’] column.
In ‘ind’, copy the indices where greater than or equal to two in each rows.
‘qqq’ get the difference between the last and first element of each row.
‘ttt’ difference in seconds.
‘tq’ we get what we need and store it in the df1[‘test’] column.
import pandas as pd
import numpy as np
df1[['Start', 'End']] = df1[['Start', 'End']].apply(pd.to_datetime)
df2['Timestap'] = pd.to_datetime(df2['Timestap'])
df1['123'] = [df2.loc[(df2['Timestap'] >= df1.iloc[i, 1])
& (df2['Timestap'] <= df1.iloc[i, 2]), 'col3'].values for i in range(len(df1))]
ind = df1[df1['123'].str.len().ge(2)].index
qqq = df1.loc[ind, '123'].str[-1] - df1.loc[ind, '123'].str[0]
ttt = (df1.loc[ind, 'End'] - df1.loc[ind, 'Start']) / np.timedelta64(1, 's')
tq = qqq / ttt
df1['test'] = tq
print(df1)
Output
ID Start End ... col2 123 test
0 23468 2011-01-03 01:01:03 2011-01-03 01:04:05 ... a [3, 4, 6] 0.016484
1 23468 2011-01-15 08:20:00 2011-01-18 01:01:01 ... b [] NaN
2 23468 2011-02-03 01:07:20 2011-02-08 12:00:03 ... a [10, 20, 30] 0.000042
3 33525 2011-02-03 01:07:19 2011-02-06 12:00:03 ... a [10, 20, 30] 0.000067