How to fillna limited by date in a groupby

Question:

I am working with the following Dataframe that has some NaN values inside.

df = pd.DataFrame({'day':[pd.datetime(2020,1,1),pd.datetime(2020,1,3),pd.datetime(2020,1,4),pd.datetime(2020,1,5),pd.datetime(2020,1,6),pd.datetime(2020,1,7),pd.datetime(2020,1,8),pd.datetime(2020,1,8),pd.datetime(2020,6,9)],
                   'TradeID':['01','02','03','04','05','06','07','08','09'],
                   'Security': ['GOOGLE', 'GOOGLE', 'APPLE', 'GOOGLE', 'GOOGLE','GOOGLE','GOOGLE','GOOGLE','GOOGLE'], 
                   'ID': ['ID001', 'ID001', 'ID001', 'ID001', 'ID001','ID001','ID001','ID001','ID001'], 
                   'BSType': ['B', 'S', 'B', 'B', 'B','S','S','S','B'], 
                   'Price':[105.901,106.969,np.nan,107.037,107.038,107.136,np.nan,107.25,np.nan],
                   'Quantity':[1000000,-300000,np.nan,7500000,100000,-100000,np.nan,-7800000,np.nan]
                  })

Out[318]: 
         day TradeID Security     ID BSType    Price   Quantity
0 2020-01-01      01   GOOGLE  ID001      B  105.901  1000000.0
1 2020-01-03      02   GOOGLE  ID001      S  106.969  -300000.0
2 2020-01-04      03    APPLE  ID001      B      NaN        NaN
3 2020-01-05      04   GOOGLE  ID001      B  107.037  7500000.0
4 2020-01-06      05   GOOGLE  ID001      B  107.038   100000.0
5 2020-01-07      06   GOOGLE  ID001      S  107.136  -100000.0
6 2020-01-08      07   GOOGLE  ID001      S      NaN        NaN
7 2020-01-08      08   GOOGLE  ID001      S  107.250 -7800000.0
8 2020-06-09      09   GOOGLE  ID001      B      NaN        NaN

My goal is to fillna with the method ffill only for the same Security, same ID and limited for the next 60 days (not the next 60 observations, because there may be more than one observation per day).

Here is what i tried but is not working, it does not replace any of my NaN values

df=df.groupby(['day',"Security","ID"], as_index=False).fillna(method='ffill',limit=60)

The expected output should look like this: (Note that only the second pair of NaN values have been filled)

  • The first pair of NaN values should not be filled because is not the same Security.
  • The second pair of NaN values should be filled with the previous observation.
  • The third pair on NaN should not be filled because they are out of the 60 days scope.
Out[320]: 
         day TradeID Security     ID BSType    Price   Quantity
0 2020-01-01      01   GOOGLE  ID001      B  105.901  1000000.0
1 2020-01-03      02   GOOGLE  ID001      S  106.969  -300000.0
2 2020-01-04      03    APPLE  ID001      B      NaN        NaN
3 2020-01-05      04   GOOGLE  ID001      B  107.037  7500000.0
4 2020-01-06      05   GOOGLE  ID001      B  107.038   100000.0
5 2020-01-07      06   GOOGLE  ID001      S  107.136  -100000.0
6 2020-01-08      07   GOOGLE  ID001      S  107.136  -100000.0
7 2020-01-08      08   GOOGLE  ID001      S  107.250 -7800000.0
8 2020-06-09      09   GOOGLE  ID001      B      NaN        NaN

So, my question is, ¿is there a plausible way to fill NaN values limiting the ffill method on a certain period?

Thank you very much for you time.

Answers:

Here’s my attempt, not sure if this is particularly scalable though:

filled_df = df.groupby(["Security","ID"], as_index=False).fillna(method='ffill')
diffs = df.groupby(["Security","ID"])["day"].diff().dt.days
df["diffs"] = diffs
df["price_isna"] = df["Price"].isna()
df["quantity_isna"] = df["Quantity"].isna()
df = df.drop(columns=["Price", "Quantity"]).merge(filled_df, on=["day", "TradeID", "BSType"])

def reverse_fillna(value, value_isna, diffs, time_limit=60):
    if (value_isna and (diffs <= time_limit)) or (not value_isna):
        return value
    else:
        return np.nan
    
df['Price'] = df.apply(lambda row: reverse_fillna(row['Price'], row['price_isna'], row['diffs']), axis=1)
df['Quantity'] = df.apply(lambda row: reverse_fillna(row['Quantity'], row['quantity_isna'], row['diffs']), axis=1)

df.drop(columns=["price_isna", "quantity_isna", "diffs"], inplace=True)
Answered By: ap1997

You can group the dataframe on columns Security and ID along with an additional grouper for column day with frequency set to 60 days then use ffill to forward fill the values for the next 60 days:

g = pd.Grouper(key='day', freq='60d')
df.assign(**df.groupby(["Security","ID", g]).ffill())

         day TradeID Security     ID BSType    Price   Quantity
0 2020-01-01      01   GOOGLE  ID001      B  105.901  1000000.0
1 2020-01-03      02   GOOGLE  ID001      S  106.969  -300000.0
2 2020-01-04      03    APPLE  ID001      B      NaN        NaN
3 2020-01-05      04   GOOGLE  ID001      B  107.037  7500000.0
4 2020-01-06      05   GOOGLE  ID001      B  107.038   100000.0
5 2020-01-07      06   GOOGLE  ID001      S  107.136  -100000.0
6 2020-01-08      07   GOOGLE  ID001      S  107.136  -100000.0
7 2020-01-08      08   GOOGLE  ID001      S  107.250 -7800000.0
8 2020-06-09      09   GOOGLE  ID001      B      NaN        NaN
Answered By: Shubham Sharma
id date amount
123 2020-01-01 10000
123 2020-02-21 NaN
123 2020-03-16 NaN
Answered By: Danny
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.