How to assign new column value based on max value in another column preceding date

Question:

I would like to create a new column called CDAT in the following dataframe. With CDAT equal to the "DATE" of the last "BRED" EVENT from the same ID, LACT and FDAT combination that preceded the "PREG" Event

Effectively I need to groupby on ID, LACT and FDAT and then when there is a "PREG" Event fill the New CDAT column with the "DATE" from the most recent "BRED" Event that precedes the date of the "PREG" EVENT.

An example of the data is presented below

    ID  LACT    FDAT    EVENT   DATE
0   46  1   2011-09-23  BRED    2012-03-02
1   46  1   2011-09-23  PREG    2012-04-03
2   46  1   2011-09-23  PREG    2012-05-22
3   46  1   2011-09-23  PREG    2012-10-09
4   46  2   2012-11-15  FRESH   2012-11-15
5   46  2   2012-11-15  LUT     2013-01-08
6   46  2   2012-11-15  OS      2013-01-15
7   46  2   2012-11-15  BRED    2013-01-01
8   46  2   2012-11-15  BRED    2013-01-24
9   46  2   2012-11-15  PREG    2013-02-26
10  46  2   2012-11-16  BRED    2013-03-10

The Output I would like to achieve is

    ID  LACT    FDAT    EVENT   DATE         CDAT
0   46  1   2011-09-23  BRED    2012-03-02
1   46  1   2011-09-23  PREG    2012-04-03   2012-03-02
2   46  1   2011-09-23  PREG    2012-05-22   2012-03-02
3   46  1   2011-09-23  PREG    2012-10-09   2012-03-02
4   46  2   2012-11-15  FRESH   2012-11-15
5   46  2   2012-11-15  LUT     2013-01-08
6   46  2   2012-11-15  OS      2013-01-15
7   46  2   2012-11-15  BRED    2013-01-01
8   46  2   2012-11-15  BRED    2013-01-24
9   46  2   2012-11-15  PREG    2013-02-26   2013-01-24
10  46  2   2012-11-16  BRED    2013-03-10

I cannot think of a way to incorporate the date and EVENT selection into a groupby statement that would achieve what I would like to do.

A list of the sample data is presented below

[[46,1,Timestamp('2011-09-23 00:00:00'),'BRED',Timestamp('2012-03-02 00:00:00')],
 [46,1,Timestamp('2011-09-23 00:00:00'),'PREG',Timestamp('2012-04-03 00:00:00')],
 [46,1,Timestamp('2011-09-23 00:00:00'),'PREG',Timestamp('2012-05-22 00:00:00')],
 [46,1,Timestamp('2011-09-23 00:00:00'),'PREG',Timestamp('2012-10-09 00:00:00')],
 [46,2,Timestamp('2012-11-15 00:00:00'),'FRESH',Timestamp('2012-11-15 00:00:00')],
 [46,2,Timestamp('2012-11-15 00:00:00'),'LUT',Timestamp('2013-01-08 00:00:00')],
 [46,2,Timestamp('2012-11-15 00:00:00'),'OS',Timestamp('2013-01-15 00:00:00')],
 [46,2,Timestamp('2012-11-15 00:00:00'),'BRED',Timestamp('2013-01-01 00:00:00')],
 [46,2,Timestamp('2012-11-15 00:00:00'),'BRED',Timestamp('2013-01-24 00:00:00')],
 [46,2,Timestamp('2012-11-15 00:00:00'),'PREG',Timestamp('2013-02-26 00:00:00')],
 [46,2,Timestamp('2012-11-16 00:00:00'),'BRED',Timestamp('2013-03-10 00:00:00')],
 [46,2,Timestamp('2012-11-15 00:00:00'),'PREG',Timestamp('2013-04-16 00:00:00')],
 [46,2,Timestamp('2001-11-15 00:00:00'),'PREG',Timestamp('2013-08-06 00:00:00')]]
Asked By: JohnH

||

Answers:

This should work..

import pandas as pd
import numpy as np

df = pd.DataFrame([[46,1,pd.Timestamp('2011-09-23'),'BRED',pd.Timestamp('2012-03-02')],
                     [46,1,pd.Timestamp('2011-09-23'),'PREG',pd.Timestamp('2012-04-03')],
                     [46,1,pd.Timestamp('2011-09-23'),'PREG',pd.Timestamp('2012-05-22')],
                     [46,1,pd.Timestamp('2011-09-23'),'PREG',pd.Timestamp('2012-10-09')],
                     [46,2,pd.Timestamp('2012-11-15'),'FRESH',pd.Timestamp('2012-11-15')],
                     [46,2,pd.Timestamp('2012-11-15'),'LUT',pd.Timestamp('2013-01-08')],
                     [46,2,pd.Timestamp('2012-11-15'),'OS',pd.Timestamp('2013-01-15')],
                     [46,2,pd.Timestamp('2012-11-15'),'BRED',pd.Timestamp('2013-01-01')],
                     [46,2,pd.Timestamp('2012-11-15'),'BRED',pd.Timestamp('2013-01-24')],
                     [46,2,pd.Timestamp('2012-11-15'),'PREG',pd.Timestamp('2013-02-26')],
                     [46,2,pd.Timestamp('2012-11-16'),'BRED',pd.Timestamp('2013-03-10')]],
                    columns=['ID', 'LACT', 'FDAT', 'EVENT', 'DATE'])

df = df.sort_values(['ID', 'LACT', 'FDAT', 'DATE'])

last_bred_dates = []
for name, group in df.groupby(['ID', 'LACT', 'FDAT']):
    last_bred_date = np.nan
    for i, row in group.iterrows():
        if row['EVENT'] == 'BRED':
            last_bred_date = row['DATE']
            last_bred_dates.append(np.nan)
        elif row['EVENT'] == 'PREG':
            last_bred_dates.append(last_bred_date)
        else:
            last_bred_dates.append(np.nan)
        
df['CDAT'] = pd.Series(last_bred_dates)

Output:

ID LACT FDAT EVENT DATE CDAT
0 46 1 2011-09-23 00:00:00 BRED 2012-03-02 00:00:00 NaT
1 46 1 2011-09-23 00:00:00 PREG 2012-04-03 00:00:00 2012-03-02 00:00:00
2 46 1 2011-09-23 00:00:00 PREG 2012-05-22 00:00:00 2012-03-02 00:00:00
3 46 1 2011-09-23 00:00:00 PREG 2012-10-09 00:00:00 2012-03-02 00:00:00
4 46 2 2012-11-15 00:00:00 FRESH 2012-11-15 00:00:00 NaT
7 46 2 2012-11-15 00:00:00 BRED 2013-01-01 00:00:00 NaT
5 46 2 2012-11-15 00:00:00 LUT 2013-01-08 00:00:00 NaT
6 46 2 2012-11-15 00:00:00 OS 2013-01-15 00:00:00 NaT
8 46 2 2012-11-15 00:00:00 BRED 2013-01-24 00:00:00 NaT
9 46 2 2012-11-15 00:00:00 PREG 2013-02-26 00:00:00 2013-01-24 00:00:00
10 46 2 2012-11-16 00:00:00 BRED 2013-03-10 00:00:00 NaT

Explanation:

Group the df based on ['ID', 'LACT', 'FDAT'] to get the desired groups. Then create an empty list and iterate on that groups, if the EVENTof that row is a BRED Event save the DATE value and append a NaN to the list, if the EVENTof that row is a PREG Event append the saved value to the list, with any other event append a NaN to the list. Finally use that list to create the new CDAT column.
Note that before iterating on each group the variable last_bred_date is assigned with a NaN in order to append to the list only the dates of that group.

Answered By: Pedro Rocha
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.