Conditionally append rows from one dataframe to another based on key-id and date range

Question:

I have two large dataframes with the same columns. Example data of relevance is presented below.

import pandas as pd 
    
df1 = [[52, '1', 'Mast', '1/1/2022'], [54, '1', 'Mast', '1/1/2023'],
       [55, '2', 'Mast', '25/5/2022'], [52, '1', 'Mast', '14/7/2022']]
    
df = pd.DataFrame(df1, columns =['Cow','Lact', 'Event', 'Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')


df2 = [[60, '1', 'Other', '10/1/2022'], [54, '1', 'Other', '3/1/2023'],
       [55, '2', 'Other', '30/5/2022'], [52, '1', 'Other', '14/8/2022']]

    
df2 = pd.DataFrame(df2, columns =['Cow', 'Lact', 'Event', 'Date'])
df2['Date'] = pd.to_datetime(df2['Date'], format='%d/%m/%Y')

The structure of df is

Cow Lact    Event   Date
0   52  1   Mast    2022-01-01
1   54  1   Mast    2023-01-01
2   55  2   Mast    2022-05-25
3   52  1   Mast    2022-07-14

The structure of df2 is

    Cow Lact    Event   Date
0   60  1   Other   2022-01-10
1   54  1   Other   2023-01-03
2   55  2   Other   2022-05-30
3   52  1   Other   2022-08-14

I would like to append rows from df2 to df if the "Cow" and "Lact" from df2 matches df and the "Date" of df2 fits within the range of the "Date" in df and five days greater than the "Date" of df.

The desired output with this example data is

Cow Lact    Event   Date
0   52  1   Mast    2022-01-01
1   54  1   Mast    2023-01-01
2   54  1   Other   2023-01-03
3   55  2   Mast    2022-05-25
4   55  2   Other   2022-05-30
5   52  1   Mast    2022-07-14

Any ideas appreciated.

Asked By: JohnH

||

Answers:

Not a clean solution, but it works:

# create temp columns cow-Lact
df['cow_lact'] = df['Cow'].astype(str) + '_' +  df['Lact'].astype(str)
df2['cow_lact'] = df2['Cow'].astype(str) + '_' +  df2['Lact'].astype(str)

# select correct colunms from df2 
df_append = df2[
    df2['cow_lact'].isin(df['cow_lact'])
    & df2['Date'].between(df['Date'].min(), df['Date'].max() + pd.Timedelta(days=5))]

# concat
result = pd.concat([df,df_append], ignore_index=True)
# drop temp column
result.drop(columns=['cow_lact'])

Answered By: Artyom Akselrod

Using merge_asof:

df['Date'] = pd.to_datetime(df['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])

keep = pd.merge_asof(
              df[['Cow', 'Lact', 'Date']].sort_values(by='Date'),
              df2.reset_index().sort_values(by='Date'),
              on='Date', by=['Cow', 'Lact'],
              direction='forward',
              tolerance=pd.Timedelta('5D')
             )['index'].dropna()

out = pd.concat([df, df2.loc[keep]])

Output:

   Cow Lact  Event       Date
0   52    1   Mast 2022-01-01
1   54    1   Mast 2023-01-01
2   55    2   Mast 2022-05-25
3   52    1   Mast 2022-07-14
2   55    2  Other 2022-05-30
1   54    1  Other 2023-01-03
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.