How to fillna limited by date in a groupby
Question:
I am working with the following Dataframe that has some NaN values inside.
df = pd.DataFrame({'day':[pd.datetime(2020,1,1),pd.datetime(2020,1,3),pd.datetime(2020,1,4),pd.datetime(2020,1,5),pd.datetime(2020,1,6),pd.datetime(2020,1,7),pd.datetime(2020,1,8),pd.datetime(2020,1,8),pd.datetime(2020,6,9)],
'TradeID':['01','02','03','04','05','06','07','08','09'],
'Security': ['GOOGLE', 'GOOGLE', 'APPLE', 'GOOGLE', 'GOOGLE','GOOGLE','GOOGLE','GOOGLE','GOOGLE'],
'ID': ['ID001', 'ID001', 'ID001', 'ID001', 'ID001','ID001','ID001','ID001','ID001'],
'BSType': ['B', 'S', 'B', 'B', 'B','S','S','S','B'],
'Price':[105.901,106.969,np.nan,107.037,107.038,107.136,np.nan,107.25,np.nan],
'Quantity':[1000000,-300000,np.nan,7500000,100000,-100000,np.nan,-7800000,np.nan]
})
Out[318]:
day TradeID Security ID BSType Price Quantity
0 2020-01-01 01 GOOGLE ID001 B 105.901 1000000.0
1 2020-01-03 02 GOOGLE ID001 S 106.969 -300000.0
2 2020-01-04 03 APPLE ID001 B NaN NaN
3 2020-01-05 04 GOOGLE ID001 B 107.037 7500000.0
4 2020-01-06 05 GOOGLE ID001 B 107.038 100000.0
5 2020-01-07 06 GOOGLE ID001 S 107.136 -100000.0
6 2020-01-08 07 GOOGLE ID001 S NaN NaN
7 2020-01-08 08 GOOGLE ID001 S 107.250 -7800000.0
8 2020-06-09 09 GOOGLE ID001 B NaN NaN
My goal is to fillna with the method ffill only for the same Security, same ID and limited for the next 60 days (not the next 60 observations, because there may be more than one observation per day).
Here is what i tried but is not working, it does not replace any of my NaN values
df=df.groupby(['day',"Security","ID"], as_index=False).fillna(method='ffill',limit=60)
The expected output should look like this: (Note that only the second pair of NaN values have been filled)
- The first pair of NaN values should not be filled because is not the same Security.
- The second pair of NaN values should be filled with the previous observation.
- The third pair on NaN should not be filled because they are out of the 60 days scope.
Out[320]:
day TradeID Security ID BSType Price Quantity
0 2020-01-01 01 GOOGLE ID001 B 105.901 1000000.0
1 2020-01-03 02 GOOGLE ID001 S 106.969 -300000.0
2 2020-01-04 03 APPLE ID001 B NaN NaN
3 2020-01-05 04 GOOGLE ID001 B 107.037 7500000.0
4 2020-01-06 05 GOOGLE ID001 B 107.038 100000.0
5 2020-01-07 06 GOOGLE ID001 S 107.136 -100000.0
6 2020-01-08 07 GOOGLE ID001 S 107.136 -100000.0
7 2020-01-08 08 GOOGLE ID001 S 107.250 -7800000.0
8 2020-06-09 09 GOOGLE ID001 B NaN NaN
So, my question is, ¿is there a plausible way to fill NaN values limiting the ffill method on a certain period?
Thank you very much for you time.
Answers:
Here’s my attempt, not sure if this is particularly scalable though:
filled_df = df.groupby(["Security","ID"], as_index=False).fillna(method='ffill')
diffs = df.groupby(["Security","ID"])["day"].diff().dt.days
df["diffs"] = diffs
df["price_isna"] = df["Price"].isna()
df["quantity_isna"] = df["Quantity"].isna()
df = df.drop(columns=["Price", "Quantity"]).merge(filled_df, on=["day", "TradeID", "BSType"])
def reverse_fillna(value, value_isna, diffs, time_limit=60):
if (value_isna and (diffs <= time_limit)) or (not value_isna):
return value
else:
return np.nan
df['Price'] = df.apply(lambda row: reverse_fillna(row['Price'], row['price_isna'], row['diffs']), axis=1)
df['Quantity'] = df.apply(lambda row: reverse_fillna(row['Quantity'], row['quantity_isna'], row['diffs']), axis=1)
df.drop(columns=["price_isna", "quantity_isna", "diffs"], inplace=True)
You can group
the dataframe on columns Security
and ID
along with an additional grouper
for column day
with frequency set to 60 days
then use ffill
to forward fill the values for the next 60 days
:
g = pd.Grouper(key='day', freq='60d')
df.assign(**df.groupby(["Security","ID", g]).ffill())
day TradeID Security ID BSType Price Quantity
0 2020-01-01 01 GOOGLE ID001 B 105.901 1000000.0
1 2020-01-03 02 GOOGLE ID001 S 106.969 -300000.0
2 2020-01-04 03 APPLE ID001 B NaN NaN
3 2020-01-05 04 GOOGLE ID001 B 107.037 7500000.0
4 2020-01-06 05 GOOGLE ID001 B 107.038 100000.0
5 2020-01-07 06 GOOGLE ID001 S 107.136 -100000.0
6 2020-01-08 07 GOOGLE ID001 S 107.136 -100000.0
7 2020-01-08 08 GOOGLE ID001 S 107.250 -7800000.0
8 2020-06-09 09 GOOGLE ID001 B NaN NaN
id
date
amount
123
2020-01-01
10000
123
2020-02-21
NaN
123
2020-03-16
NaN
I am working with the following Dataframe that has some NaN values inside.
df = pd.DataFrame({'day':[pd.datetime(2020,1,1),pd.datetime(2020,1,3),pd.datetime(2020,1,4),pd.datetime(2020,1,5),pd.datetime(2020,1,6),pd.datetime(2020,1,7),pd.datetime(2020,1,8),pd.datetime(2020,1,8),pd.datetime(2020,6,9)],
'TradeID':['01','02','03','04','05','06','07','08','09'],
'Security': ['GOOGLE', 'GOOGLE', 'APPLE', 'GOOGLE', 'GOOGLE','GOOGLE','GOOGLE','GOOGLE','GOOGLE'],
'ID': ['ID001', 'ID001', 'ID001', 'ID001', 'ID001','ID001','ID001','ID001','ID001'],
'BSType': ['B', 'S', 'B', 'B', 'B','S','S','S','B'],
'Price':[105.901,106.969,np.nan,107.037,107.038,107.136,np.nan,107.25,np.nan],
'Quantity':[1000000,-300000,np.nan,7500000,100000,-100000,np.nan,-7800000,np.nan]
})
Out[318]:
day TradeID Security ID BSType Price Quantity
0 2020-01-01 01 GOOGLE ID001 B 105.901 1000000.0
1 2020-01-03 02 GOOGLE ID001 S 106.969 -300000.0
2 2020-01-04 03 APPLE ID001 B NaN NaN
3 2020-01-05 04 GOOGLE ID001 B 107.037 7500000.0
4 2020-01-06 05 GOOGLE ID001 B 107.038 100000.0
5 2020-01-07 06 GOOGLE ID001 S 107.136 -100000.0
6 2020-01-08 07 GOOGLE ID001 S NaN NaN
7 2020-01-08 08 GOOGLE ID001 S 107.250 -7800000.0
8 2020-06-09 09 GOOGLE ID001 B NaN NaN
My goal is to fillna with the method ffill only for the same Security, same ID and limited for the next 60 days (not the next 60 observations, because there may be more than one observation per day).
Here is what i tried but is not working, it does not replace any of my NaN values
df=df.groupby(['day',"Security","ID"], as_index=False).fillna(method='ffill',limit=60)
The expected output should look like this: (Note that only the second pair of NaN values have been filled)
- The first pair of NaN values should not be filled because is not the same Security.
- The second pair of NaN values should be filled with the previous observation.
- The third pair on NaN should not be filled because they are out of the 60 days scope.
Out[320]:
day TradeID Security ID BSType Price Quantity
0 2020-01-01 01 GOOGLE ID001 B 105.901 1000000.0
1 2020-01-03 02 GOOGLE ID001 S 106.969 -300000.0
2 2020-01-04 03 APPLE ID001 B NaN NaN
3 2020-01-05 04 GOOGLE ID001 B 107.037 7500000.0
4 2020-01-06 05 GOOGLE ID001 B 107.038 100000.0
5 2020-01-07 06 GOOGLE ID001 S 107.136 -100000.0
6 2020-01-08 07 GOOGLE ID001 S 107.136 -100000.0
7 2020-01-08 08 GOOGLE ID001 S 107.250 -7800000.0
8 2020-06-09 09 GOOGLE ID001 B NaN NaN
So, my question is, ¿is there a plausible way to fill NaN values limiting the ffill method on a certain period?
Thank you very much for you time.
Here’s my attempt, not sure if this is particularly scalable though:
filled_df = df.groupby(["Security","ID"], as_index=False).fillna(method='ffill')
diffs = df.groupby(["Security","ID"])["day"].diff().dt.days
df["diffs"] = diffs
df["price_isna"] = df["Price"].isna()
df["quantity_isna"] = df["Quantity"].isna()
df = df.drop(columns=["Price", "Quantity"]).merge(filled_df, on=["day", "TradeID", "BSType"])
def reverse_fillna(value, value_isna, diffs, time_limit=60):
if (value_isna and (diffs <= time_limit)) or (not value_isna):
return value
else:
return np.nan
df['Price'] = df.apply(lambda row: reverse_fillna(row['Price'], row['price_isna'], row['diffs']), axis=1)
df['Quantity'] = df.apply(lambda row: reverse_fillna(row['Quantity'], row['quantity_isna'], row['diffs']), axis=1)
df.drop(columns=["price_isna", "quantity_isna", "diffs"], inplace=True)
You can group
the dataframe on columns Security
and ID
along with an additional grouper
for column day
with frequency set to 60 days
then use ffill
to forward fill the values for the next 60 days
:
g = pd.Grouper(key='day', freq='60d')
df.assign(**df.groupby(["Security","ID", g]).ffill())
day TradeID Security ID BSType Price Quantity
0 2020-01-01 01 GOOGLE ID001 B 105.901 1000000.0
1 2020-01-03 02 GOOGLE ID001 S 106.969 -300000.0
2 2020-01-04 03 APPLE ID001 B NaN NaN
3 2020-01-05 04 GOOGLE ID001 B 107.037 7500000.0
4 2020-01-06 05 GOOGLE ID001 B 107.038 100000.0
5 2020-01-07 06 GOOGLE ID001 S 107.136 -100000.0
6 2020-01-08 07 GOOGLE ID001 S 107.136 -100000.0
7 2020-01-08 08 GOOGLE ID001 S 107.250 -7800000.0
8 2020-06-09 09 GOOGLE ID001 B NaN NaN
id | date | amount |
---|---|---|
123 | 2020-01-01 | 10000 |
123 | 2020-02-21 | NaN |
123 | 2020-03-16 | NaN |