Pandas rolling gives NaN
Question:
I’m looking at the tutorials on window functions, but I don’t quite understand why the following code produces NaNs.
If I understand correctly, the code creates a rolling window of size 2. Why do the first, fourth, and fifth rows have NaN? At first, I thought it’s because adding NaN with another number would produce NaN, but then I’m not sure why the second row wouldn’t be NaN.
dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
index=pd.date_range('20130101 09:00:00', periods=5, freq='s'))
In [58]: dft.rolling(2).sum()
Out[58]:
B
2013-01-01 09:00:00 NaN
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 NaN
2013-01-01 09:00:04 NaN
Answers:
Indeed adding NAN and anything else gives NAN. So:
input + rolled = sum
0 nan nan
1 0 1
2 1 3
nan 2 nan
4 nan nan
There’s no reason for the second row to be NAN, because it’s the sum of the original first and second elements, neither of which is NAN.
Another way to do it is:
dft.B + dft.B.shift()
The first thing to notice is that by default rolling
looks for n-1 prior rows of data to aggregate, where n is the window size. If that condition is not met, it will return NaN for the window. This is what’s happening at the first row. In the fourth and fifth row, it’s because one of the values in the sum is NaN.
If you would like to avoid returning NaN, you could pass min_periods=1
to the method which reduces the minimum required number of valid observations in the window to 1 instead of 2:
>>> dft.rolling(2, min_periods=1).sum()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 4.0
Instead of rolling(2), use rolling(‘2d’)
dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
index=pd.date_range('20130101 09:00:00', periods=5, freq='s'))
dft.rolling('2d').sum()
Using min_periods=1
can lead to high variance for the values in the rolling window. Another way to remove NaN
values is to use fillna
on the rolling window:
>>> dft.rolling(2).sum().fillna(method='bfill').fillna(method='ffill')
B
2013-01-01 09:00:00 1.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:04 3.0
An example with a rolling window size of 6 illustrates the issue:
>>> dft = pd.DataFrame({'B': [10, 1, 10, 1, 10, 1, 10, 1, 10, 1]}, index=pd.date_range('20130101 09:00:00', periods=10, freq='s'))
>>> dft.rolling(6, min_periods=1).sum()
B
2013-01-01 09:00:00 10.0
2013-01-01 09:00:01 11.0
2013-01-01 09:00:02 21.0
2013-01-01 09:00:03 22.0
2013-01-01 09:00:04 32.0
2013-01-01 09:00:05 33.0
2013-01-01 09:00:06 33.0
2013-01-01 09:00:07 33.0
2013-01-01 09:00:08 33.0
2013-01-01 09:00:09 33.0
>>> dft.rolling(6).sum().fillna(method='bfill')
B
2013-01-01 09:00:00 33.0
2013-01-01 09:00:01 33.0
2013-01-01 09:00:02 33.0
2013-01-01 09:00:03 33.0
2013-01-01 09:00:04 33.0
2013-01-01 09:00:05 33.0
2013-01-01 09:00:06 33.0
2013-01-01 09:00:07 33.0
2013-01-01 09:00:08 33.0
2013-01-01 09:00:09 33.0
Whereas using min_periods=1
leads to values below 33.0 for the first 5 values, using fillna
produces the expected 33.0 throughout the window. Depending on your use case you might want to use fillna
.
I’m looking at the tutorials on window functions, but I don’t quite understand why the following code produces NaNs.
If I understand correctly, the code creates a rolling window of size 2. Why do the first, fourth, and fifth rows have NaN? At first, I thought it’s because adding NaN with another number would produce NaN, but then I’m not sure why the second row wouldn’t be NaN.
dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
index=pd.date_range('20130101 09:00:00', periods=5, freq='s'))
In [58]: dft.rolling(2).sum()
Out[58]:
B
2013-01-01 09:00:00 NaN
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 NaN
2013-01-01 09:00:04 NaN
Indeed adding NAN and anything else gives NAN. So:
input + rolled = sum
0 nan nan
1 0 1
2 1 3
nan 2 nan
4 nan nan
There’s no reason for the second row to be NAN, because it’s the sum of the original first and second elements, neither of which is NAN.
Another way to do it is:
dft.B + dft.B.shift()
The first thing to notice is that by default rolling
looks for n-1 prior rows of data to aggregate, where n is the window size. If that condition is not met, it will return NaN for the window. This is what’s happening at the first row. In the fourth and fifth row, it’s because one of the values in the sum is NaN.
If you would like to avoid returning NaN, you could pass min_periods=1
to the method which reduces the minimum required number of valid observations in the window to 1 instead of 2:
>>> dft.rolling(2, min_periods=1).sum()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 4.0
Instead of rolling(2), use rolling(‘2d’)
dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
index=pd.date_range('20130101 09:00:00', periods=5, freq='s'))
dft.rolling('2d').sum()
Using min_periods=1
can lead to high variance for the values in the rolling window. Another way to remove NaN
values is to use fillna
on the rolling window:
>>> dft.rolling(2).sum().fillna(method='bfill').fillna(method='ffill')
B
2013-01-01 09:00:00 1.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:04 3.0
An example with a rolling window size of 6 illustrates the issue:
>>> dft = pd.DataFrame({'B': [10, 1, 10, 1, 10, 1, 10, 1, 10, 1]}, index=pd.date_range('20130101 09:00:00', periods=10, freq='s'))
>>> dft.rolling(6, min_periods=1).sum()
B
2013-01-01 09:00:00 10.0
2013-01-01 09:00:01 11.0
2013-01-01 09:00:02 21.0
2013-01-01 09:00:03 22.0
2013-01-01 09:00:04 32.0
2013-01-01 09:00:05 33.0
2013-01-01 09:00:06 33.0
2013-01-01 09:00:07 33.0
2013-01-01 09:00:08 33.0
2013-01-01 09:00:09 33.0
>>> dft.rolling(6).sum().fillna(method='bfill')
B
2013-01-01 09:00:00 33.0
2013-01-01 09:00:01 33.0
2013-01-01 09:00:02 33.0
2013-01-01 09:00:03 33.0
2013-01-01 09:00:04 33.0
2013-01-01 09:00:05 33.0
2013-01-01 09:00:06 33.0
2013-01-01 09:00:07 33.0
2013-01-01 09:00:08 33.0
2013-01-01 09:00:09 33.0
Whereas using min_periods=1
leads to values below 33.0 for the first 5 values, using fillna
produces the expected 33.0 throughout the window. Depending on your use case you might want to use fillna
.