How to assign new column value based on max value in another column preceding date
Question:
I would like to create a new column called CDAT
in the following dataframe. With CDAT
equal to the "DATE"
of the last "BRED" EVENT
from the same ID
, LACT
and FDAT
combination that preceded the "PREG" Event
Effectively I need to groupby on ID
, LACT
and FDAT
and then when there is a "PREG" Event
fill the New CDAT
column with the "DATE"
from the most recent "BRED" Event
that precedes the date of the "PREG" EVENT
.
An example of the data is presented below
ID LACT FDAT EVENT DATE
0 46 1 2011-09-23 BRED 2012-03-02
1 46 1 2011-09-23 PREG 2012-04-03
2 46 1 2011-09-23 PREG 2012-05-22
3 46 1 2011-09-23 PREG 2012-10-09
4 46 2 2012-11-15 FRESH 2012-11-15
5 46 2 2012-11-15 LUT 2013-01-08
6 46 2 2012-11-15 OS 2013-01-15
7 46 2 2012-11-15 BRED 2013-01-01
8 46 2 2012-11-15 BRED 2013-01-24
9 46 2 2012-11-15 PREG 2013-02-26
10 46 2 2012-11-16 BRED 2013-03-10
The Output I would like to achieve is
ID LACT FDAT EVENT DATE CDAT
0 46 1 2011-09-23 BRED 2012-03-02
1 46 1 2011-09-23 PREG 2012-04-03 2012-03-02
2 46 1 2011-09-23 PREG 2012-05-22 2012-03-02
3 46 1 2011-09-23 PREG 2012-10-09 2012-03-02
4 46 2 2012-11-15 FRESH 2012-11-15
5 46 2 2012-11-15 LUT 2013-01-08
6 46 2 2012-11-15 OS 2013-01-15
7 46 2 2012-11-15 BRED 2013-01-01
8 46 2 2012-11-15 BRED 2013-01-24
9 46 2 2012-11-15 PREG 2013-02-26 2013-01-24
10 46 2 2012-11-16 BRED 2013-03-10
I cannot think of a way to incorporate the date and EVENT selection into a groupby statement that would achieve what I would like to do.
A list of the sample data is presented below
[[46,1,Timestamp('2011-09-23 00:00:00'),'BRED',Timestamp('2012-03-02 00:00:00')],
[46,1,Timestamp('2011-09-23 00:00:00'),'PREG',Timestamp('2012-04-03 00:00:00')],
[46,1,Timestamp('2011-09-23 00:00:00'),'PREG',Timestamp('2012-05-22 00:00:00')],
[46,1,Timestamp('2011-09-23 00:00:00'),'PREG',Timestamp('2012-10-09 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'FRESH',Timestamp('2012-11-15 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'LUT',Timestamp('2013-01-08 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'OS',Timestamp('2013-01-15 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'BRED',Timestamp('2013-01-01 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'BRED',Timestamp('2013-01-24 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'PREG',Timestamp('2013-02-26 00:00:00')],
[46,2,Timestamp('2012-11-16 00:00:00'),'BRED',Timestamp('2013-03-10 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'PREG',Timestamp('2013-04-16 00:00:00')],
[46,2,Timestamp('2001-11-15 00:00:00'),'PREG',Timestamp('2013-08-06 00:00:00')]]
Answers:
This should work..
import pandas as pd
import numpy as np
df = pd.DataFrame([[46,1,pd.Timestamp('2011-09-23'),'BRED',pd.Timestamp('2012-03-02')],
[46,1,pd.Timestamp('2011-09-23'),'PREG',pd.Timestamp('2012-04-03')],
[46,1,pd.Timestamp('2011-09-23'),'PREG',pd.Timestamp('2012-05-22')],
[46,1,pd.Timestamp('2011-09-23'),'PREG',pd.Timestamp('2012-10-09')],
[46,2,pd.Timestamp('2012-11-15'),'FRESH',pd.Timestamp('2012-11-15')],
[46,2,pd.Timestamp('2012-11-15'),'LUT',pd.Timestamp('2013-01-08')],
[46,2,pd.Timestamp('2012-11-15'),'OS',pd.Timestamp('2013-01-15')],
[46,2,pd.Timestamp('2012-11-15'),'BRED',pd.Timestamp('2013-01-01')],
[46,2,pd.Timestamp('2012-11-15'),'BRED',pd.Timestamp('2013-01-24')],
[46,2,pd.Timestamp('2012-11-15'),'PREG',pd.Timestamp('2013-02-26')],
[46,2,pd.Timestamp('2012-11-16'),'BRED',pd.Timestamp('2013-03-10')]],
columns=['ID', 'LACT', 'FDAT', 'EVENT', 'DATE'])
df = df.sort_values(['ID', 'LACT', 'FDAT', 'DATE'])
last_bred_dates = []
for name, group in df.groupby(['ID', 'LACT', 'FDAT']):
last_bred_date = np.nan
for i, row in group.iterrows():
if row['EVENT'] == 'BRED':
last_bred_date = row['DATE']
last_bred_dates.append(np.nan)
elif row['EVENT'] == 'PREG':
last_bred_dates.append(last_bred_date)
else:
last_bred_dates.append(np.nan)
df['CDAT'] = pd.Series(last_bred_dates)
Output:
ID
LACT
FDAT
EVENT
DATE
CDAT
0
46
1
2011-09-23 00:00:00
BRED
2012-03-02 00:00:00
NaT
1
46
1
2011-09-23 00:00:00
PREG
2012-04-03 00:00:00
2012-03-02 00:00:00
2
46
1
2011-09-23 00:00:00
PREG
2012-05-22 00:00:00
2012-03-02 00:00:00
3
46
1
2011-09-23 00:00:00
PREG
2012-10-09 00:00:00
2012-03-02 00:00:00
4
46
2
2012-11-15 00:00:00
FRESH
2012-11-15 00:00:00
NaT
7
46
2
2012-11-15 00:00:00
BRED
2013-01-01 00:00:00
NaT
5
46
2
2012-11-15 00:00:00
LUT
2013-01-08 00:00:00
NaT
6
46
2
2012-11-15 00:00:00
OS
2013-01-15 00:00:00
NaT
8
46
2
2012-11-15 00:00:00
BRED
2013-01-24 00:00:00
NaT
9
46
2
2012-11-15 00:00:00
PREG
2013-02-26 00:00:00
2013-01-24 00:00:00
10
46
2
2012-11-16 00:00:00
BRED
2013-03-10 00:00:00
NaT
Explanation:
Group the df based on ['ID', 'LACT', 'FDAT']
to get the desired groups. Then create an empty list and iterate on that groups, if the EVENT
of that row is a BRED Event
save the DATE
value and append a NaN to the list, if the EVENT
of that row is a PREG Event
append the saved value to the list, with any other event append a NaN to the list. Finally use that list to create the new CDAT
column.
Note that before iterating on each group the variable last_bred_date
is assigned with a NaN in order to append to the list only the dates of that group.
I would like to create a new column called CDAT
in the following dataframe. With CDAT
equal to the "DATE"
of the last "BRED" EVENT
from the same ID
, LACT
and FDAT
combination that preceded the "PREG" Event
Effectively I need to groupby on ID
, LACT
and FDAT
and then when there is a "PREG" Event
fill the New CDAT
column with the "DATE"
from the most recent "BRED" Event
that precedes the date of the "PREG" EVENT
.
An example of the data is presented below
ID LACT FDAT EVENT DATE
0 46 1 2011-09-23 BRED 2012-03-02
1 46 1 2011-09-23 PREG 2012-04-03
2 46 1 2011-09-23 PREG 2012-05-22
3 46 1 2011-09-23 PREG 2012-10-09
4 46 2 2012-11-15 FRESH 2012-11-15
5 46 2 2012-11-15 LUT 2013-01-08
6 46 2 2012-11-15 OS 2013-01-15
7 46 2 2012-11-15 BRED 2013-01-01
8 46 2 2012-11-15 BRED 2013-01-24
9 46 2 2012-11-15 PREG 2013-02-26
10 46 2 2012-11-16 BRED 2013-03-10
The Output I would like to achieve is
ID LACT FDAT EVENT DATE CDAT
0 46 1 2011-09-23 BRED 2012-03-02
1 46 1 2011-09-23 PREG 2012-04-03 2012-03-02
2 46 1 2011-09-23 PREG 2012-05-22 2012-03-02
3 46 1 2011-09-23 PREG 2012-10-09 2012-03-02
4 46 2 2012-11-15 FRESH 2012-11-15
5 46 2 2012-11-15 LUT 2013-01-08
6 46 2 2012-11-15 OS 2013-01-15
7 46 2 2012-11-15 BRED 2013-01-01
8 46 2 2012-11-15 BRED 2013-01-24
9 46 2 2012-11-15 PREG 2013-02-26 2013-01-24
10 46 2 2012-11-16 BRED 2013-03-10
I cannot think of a way to incorporate the date and EVENT selection into a groupby statement that would achieve what I would like to do.
A list of the sample data is presented below
[[46,1,Timestamp('2011-09-23 00:00:00'),'BRED',Timestamp('2012-03-02 00:00:00')],
[46,1,Timestamp('2011-09-23 00:00:00'),'PREG',Timestamp('2012-04-03 00:00:00')],
[46,1,Timestamp('2011-09-23 00:00:00'),'PREG',Timestamp('2012-05-22 00:00:00')],
[46,1,Timestamp('2011-09-23 00:00:00'),'PREG',Timestamp('2012-10-09 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'FRESH',Timestamp('2012-11-15 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'LUT',Timestamp('2013-01-08 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'OS',Timestamp('2013-01-15 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'BRED',Timestamp('2013-01-01 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'BRED',Timestamp('2013-01-24 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'PREG',Timestamp('2013-02-26 00:00:00')],
[46,2,Timestamp('2012-11-16 00:00:00'),'BRED',Timestamp('2013-03-10 00:00:00')],
[46,2,Timestamp('2012-11-15 00:00:00'),'PREG',Timestamp('2013-04-16 00:00:00')],
[46,2,Timestamp('2001-11-15 00:00:00'),'PREG',Timestamp('2013-08-06 00:00:00')]]
This should work..
import pandas as pd
import numpy as np
df = pd.DataFrame([[46,1,pd.Timestamp('2011-09-23'),'BRED',pd.Timestamp('2012-03-02')],
[46,1,pd.Timestamp('2011-09-23'),'PREG',pd.Timestamp('2012-04-03')],
[46,1,pd.Timestamp('2011-09-23'),'PREG',pd.Timestamp('2012-05-22')],
[46,1,pd.Timestamp('2011-09-23'),'PREG',pd.Timestamp('2012-10-09')],
[46,2,pd.Timestamp('2012-11-15'),'FRESH',pd.Timestamp('2012-11-15')],
[46,2,pd.Timestamp('2012-11-15'),'LUT',pd.Timestamp('2013-01-08')],
[46,2,pd.Timestamp('2012-11-15'),'OS',pd.Timestamp('2013-01-15')],
[46,2,pd.Timestamp('2012-11-15'),'BRED',pd.Timestamp('2013-01-01')],
[46,2,pd.Timestamp('2012-11-15'),'BRED',pd.Timestamp('2013-01-24')],
[46,2,pd.Timestamp('2012-11-15'),'PREG',pd.Timestamp('2013-02-26')],
[46,2,pd.Timestamp('2012-11-16'),'BRED',pd.Timestamp('2013-03-10')]],
columns=['ID', 'LACT', 'FDAT', 'EVENT', 'DATE'])
df = df.sort_values(['ID', 'LACT', 'FDAT', 'DATE'])
last_bred_dates = []
for name, group in df.groupby(['ID', 'LACT', 'FDAT']):
last_bred_date = np.nan
for i, row in group.iterrows():
if row['EVENT'] == 'BRED':
last_bred_date = row['DATE']
last_bred_dates.append(np.nan)
elif row['EVENT'] == 'PREG':
last_bred_dates.append(last_bred_date)
else:
last_bred_dates.append(np.nan)
df['CDAT'] = pd.Series(last_bred_dates)
Output:
ID | LACT | FDAT | EVENT | DATE | CDAT | |
---|---|---|---|---|---|---|
0 | 46 | 1 | 2011-09-23 00:00:00 | BRED | 2012-03-02 00:00:00 | NaT |
1 | 46 | 1 | 2011-09-23 00:00:00 | PREG | 2012-04-03 00:00:00 | 2012-03-02 00:00:00 |
2 | 46 | 1 | 2011-09-23 00:00:00 | PREG | 2012-05-22 00:00:00 | 2012-03-02 00:00:00 |
3 | 46 | 1 | 2011-09-23 00:00:00 | PREG | 2012-10-09 00:00:00 | 2012-03-02 00:00:00 |
4 | 46 | 2 | 2012-11-15 00:00:00 | FRESH | 2012-11-15 00:00:00 | NaT |
7 | 46 | 2 | 2012-11-15 00:00:00 | BRED | 2013-01-01 00:00:00 | NaT |
5 | 46 | 2 | 2012-11-15 00:00:00 | LUT | 2013-01-08 00:00:00 | NaT |
6 | 46 | 2 | 2012-11-15 00:00:00 | OS | 2013-01-15 00:00:00 | NaT |
8 | 46 | 2 | 2012-11-15 00:00:00 | BRED | 2013-01-24 00:00:00 | NaT |
9 | 46 | 2 | 2012-11-15 00:00:00 | PREG | 2013-02-26 00:00:00 | 2013-01-24 00:00:00 |
10 | 46 | 2 | 2012-11-16 00:00:00 | BRED | 2013-03-10 00:00:00 | NaT |
Explanation:
Group the df based on ['ID', 'LACT', 'FDAT']
to get the desired groups. Then create an empty list and iterate on that groups, if the EVENT
of that row is a BRED Event
save the DATE
value and append a NaN to the list, if the EVENT
of that row is a PREG Event
append the saved value to the list, with any other event append a NaN to the list. Finally use that list to create the new CDAT
column.
Note that before iterating on each group the variable last_bred_date
is assigned with a NaN in order to append to the list only the dates of that group.