Get a list of items sent between two dates pandas

Question:

I am new to python and i’m not sure how to achive what i’m trying to do. I have two pandas dataframes and I need to combine them. One dataframe is called 'cust' and the other one is called 'items', Below is the code to create both the datasets:

import pandas as pd
  
cust = {'id': [212175, 286170, 361739, 297438, 415712],
        'date_start': ['20/05/2022', '18/05/2022', '10/08/2021', '20/01/2022', '30/07/2021']}

items = {'id': [212175, 212175, 212175, 212175, 212175, 286170, 286170, 286170, 286170, 286170],
        'item': ['PX002','PY005','PX003','PX003','NX002','PX002','PY005','PX003','PX003','NX002'], 
        'date_sent': ['21/05/2022','10/05/2022','01/06/2022','01/07/2021','15/08/2022','19/05/2022','10/07/2022','15/07/2022','10/04/2022','15/06/2022']}

# Create DataFrame
cust_df = pd.DataFrame(cust)
items_df = pd.DataFrame(items)

Cust_df:

id     date_start
212175  20/05/2022
286170  18/05/2022
361739  10/08/2021
297438  20/01/2022
415712  30/07/2021

items_df:

 id      item    date_sent
212175  PX002   21/05/2022
212175  PY005   10/05/2022
212175  PX003   01/06/2022
212175  PX003   01/07/2021
212175  NX002   15/08/2022
286170  PX002   19/05/2022
286170  PY005   10/07/2022
286170  PX003   15/07/2022
286170  PX003   10/04/2022
286170  NX002   15/06/2022

I need to find out all the items that were sent to every customer between two dates. The start date for each customer is different and should be taken from the date_start field present in the cust_df data frame. The end date for every customer is the same which is '12/08/2022'. So, the date range for customer 212175 will be 20/05/2022 - 12/08/2022, and i need to find out all the items sent to this customer between these dates. My desired output dataframe should look like the below:

Output_dataframe:

id       item   date_sent   date_start
212175  PX002   21/05/2022  20/05/2022
212175  PX003   01/06/2022  20/05/2022
212175  PX005   10/05/2022  20/05/2022
286170  PX002   19/05/2022  18/05/2022
286170  PY005   10/07/2022  18/05/2022
286170  PX003   15/07/2022  18/05/2022
286170  NX002   15/06/2022  18/05/2022

In the output table we can see that for ID = 212175, only 3 items are showing. That’s because only 3 items were sent during the window 20/05/2022 - 12/08/2022.

For ID = 286170, only 1 item 'PX003' isn’t showing in the desired output and that’s because the date this item was sent was on '10/04/2022', which is before the date_start (18/05/2022) for customer 286170.

Any help appreciated. Thanks

Asked By: SAJ

||

Answers:

IIUC, use a merge_asof:

cust_df['date_start'] = pd.to_datetime(cust_df['date_start'], dayfirst=True)
items_df['date_sent'] = pd.to_datetime(items_df['date_sent'], dayfirst=True)

out = (pd
   .merge_asof(items_df.sort_values(by='date_sent'),
               cust_df.sort_values(by='date_start'),
               by='id', left_on='date_sent', right_on='date_start')
   .dropna(subset='date_start')
   .loc[lambda d: d['date_sent'] < pd.Timestamp("12/08/2022")]
   .sort_values(by=['id', 'item']) # optional
)

output:

       id   item  date_sent date_start
9  212175  NX002 2022-08-15 2022-05-20
4  212175  PX002 2022-05-21 2022-05-20
5  212175  PX003 2022-06-01 2022-05-20
6  286170  NX002 2022-06-15 2022-05-18
3  286170  PX002 2022-05-19 2022-05-18
8  286170  PX003 2022-07-15 2022-05-18
7  286170  PY005 2022-07-10 2022-05-18
Answered By: mozway

Merge and filter

m = pd.merge(items_df, cust_df, how='inner', on='id')
m.date_start = pd.to_datetime(m.date_start, dayfirst=True)
m.date_sent = pd.to_datetime(m.date_sent, dayfirst=True)
m.loc[ (m.date_start < m.date_sent) * ( m.date_sent < pd.to_datetime("12/08/2022", dayfirst=False))].reset_index()

#       id   item  date_sent date_start
#0  212175  PX002 2022-05-21 2022-05-20
#1  212175  PX003 2022-06-01 2022-05-20
#2  212175  NX002 2022-08-15 2022-05-20
#3  286170  PX002 2022-05-19 2022-05-18
#4  286170  PY005 2022-07-10 2022-05-18
#5  286170  PX003 2022-07-15 2022-05-18
#6  286170  NX002 2022-06-15 2022-05-18

Answered By: dermen
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.