Pandas: flag consecutive values
Question:
I have a pandas series of the form [0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1].
0: indicates economic increase.
1: indicates economic decline.
A recession is signaled by two consecutive declines (1).
The end of the recession is signaled by two consecutive increase (0).
In the above dataset I have two recessions, begin at index 3, end at index 5 and begin at index 8 end at index 11.
I am at a lost for how to approach this with pandas. I would like to identify the index for the start and end of the recession. Any assistance would be appreciated.
Here is my python attempt at a soln.
np_decline = np.array([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
recession_start_flag = 0
recession_end_flag = 0
recession_start = []
recession_end = []
for i in range(len(np_decline) - 1):
if recession_start_flag == 0 and np_decline[i] == 1 and np_decline[i + 1] == 1:
recession_start.append(i)
recession_start_flag = 1
if recession_start_flag == 1 and np_decline[i] == 0 and np_decline[i + 1] == 0:
recession_end.append(i - 1)
recession_start_flag = 0
print(recession_start)
print(recession_end)
Is the a more pandas centric approach?
Leon
Answers:
You can use shift
:
df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1], columns=['signal'])
df_prev = df.shift(1)['signal']
df_next = df.shift(-1)['signal']
df_next2 = df.shift(-2)['signal']
df.loc[(df_prev != 1) & (df['signal'] == 1) & (df_next == 1), 'start'] = 1
df.loc[(df['signal'] != 0) & (df_next == 0) & (df_next2 == 0), 'end'] = 1
df.fillna(0, inplace=True)
df = df.astype(int)
signal start end
0 0 0 0
1 1 0 0
2 0 0 0
3 1 1 0
4 1 0 0
5 1 0 1
6 0 0 0
7 0 0 0
8 1 1 0
9 1 0 0
10 0 0 0
11 1 0 1
12 0 0 0
13 0 0 0
14 1 0 0
use rolling(2)
s = pd.Series([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
I subtract .5
so the rolling
sum is 1
when a recession starts and -1
when it stops.
s2 = s.sub(.5).rolling(2).sum()
since both 1
and -1
evaluate to True
I can mask the rolling signal to just start and stops and ffill
. Get truth values of when they are positive or negative with gt(0)
.
pd.concat([s, s2.mask(~s2.astype(bool)).ffill().gt(0)], axis=1, keys=['signal', 'isRec'])
Similar idea using shift
, but writing the result as a single Boolean column:
# Boolean indexers for recession start and stops.
rec_start = (df['signal'] == 1) & (df['signal'].shift(-1) == 1)
rec_end = (df['signal'] == 0) & (df['signal'].shift(-1) == 0)
# Mark the recession start/stops as True/False.
df.loc[rec_start, 'recession'] = True
df.loc[rec_end, 'recession'] = False
# Forward fill the recession column with the last known Boolean.
# Fill any NaN's as False (i.e. locations before the first start/stop).
df['recession'] = df['recession'].ffill().fillna(False)
The resulting output:
signal recession
0 0 False
1 1 False
2 0 False
3 1 True
4 1 True
5 1 True
6 0 False
7 0 False
8 1 True
9 1 True
10 0 True
11 1 True
12 0 False
13 0 False
14 1 False
The start of a run of 1’s satisfies the condition
x_prev = x.shift(1)
x_next = x.shift(-1)
((x_prev != 1) & (x == 1) & (x_next == 1))
That is to say, the value at the start of a run is 1 and the previous value is not 1 and the next value is 1. Similarly, the end of a run satisfies the condition
((x == 1) & (x_next == 0) & (x_next2 == 0))
since the value at the end of a run is 1 and the next two values value are 0.
We can find indices where these conditions are true using np.flatnonzero
:
import numpy as np
import pandas as pd
x = pd.Series([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
x_prev = x.shift(1)
x_next = x.shift(-1)
x_next2 = x.shift(-2)
df = pd.DataFrame(
dict(start = np.flatnonzero((x_prev != 1) & (x == 1) & (x_next == 1)),
end = np.flatnonzero((x == 1) & (x_next == 0) & (x_next2 == 0))))
print(df[['start', 'end']])
yields
start end
0 3 5
1 8 11
You can use scipy.signal.find_peaks for this problem.
from scipy.signal import find_peaks
np_decline = np.array([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
peaks = find_peaks(np_decline,width=2)
recession_start_loc = peaks[1]['left_bases'][0]
def function2(dd:pd.DataFrame):
if dd.iat[0,1]>=2:
if dd.query("col1==0").pipe(len)==1:
return (dd.index.min(),dd.index.max()+1)
else:
dd1=dd.query("col1==1")
return (dd1.index.min(),dd1.index.max())
col2=df1.col1.diff().eq(1).cumsum()
df1.groupby(col2).apply(lambda dd:dd.assign(col3=dd.col1.cumprod().sum()))
.groupby('col3',sort=False).apply(function2).dropna()
out:
col3
3 (3, 5)
2 (8, 11)
dtype: object
I have a pandas series of the form [0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1].
0: indicates economic increase.
1: indicates economic decline.
A recession is signaled by two consecutive declines (1).
The end of the recession is signaled by two consecutive increase (0).
In the above dataset I have two recessions, begin at index 3, end at index 5 and begin at index 8 end at index 11.
I am at a lost for how to approach this with pandas. I would like to identify the index for the start and end of the recession. Any assistance would be appreciated.
Here is my python attempt at a soln.
np_decline = np.array([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
recession_start_flag = 0
recession_end_flag = 0
recession_start = []
recession_end = []
for i in range(len(np_decline) - 1):
if recession_start_flag == 0 and np_decline[i] == 1 and np_decline[i + 1] == 1:
recession_start.append(i)
recession_start_flag = 1
if recession_start_flag == 1 and np_decline[i] == 0 and np_decline[i + 1] == 0:
recession_end.append(i - 1)
recession_start_flag = 0
print(recession_start)
print(recession_end)
Is the a more pandas centric approach?
Leon
You can use shift
:
df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1], columns=['signal'])
df_prev = df.shift(1)['signal']
df_next = df.shift(-1)['signal']
df_next2 = df.shift(-2)['signal']
df.loc[(df_prev != 1) & (df['signal'] == 1) & (df_next == 1), 'start'] = 1
df.loc[(df['signal'] != 0) & (df_next == 0) & (df_next2 == 0), 'end'] = 1
df.fillna(0, inplace=True)
df = df.astype(int)
signal start end
0 0 0 0
1 1 0 0
2 0 0 0
3 1 1 0
4 1 0 0
5 1 0 1
6 0 0 0
7 0 0 0
8 1 1 0
9 1 0 0
10 0 0 0
11 1 0 1
12 0 0 0
13 0 0 0
14 1 0 0
use rolling(2)
s = pd.Series([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
I subtract .5
so the rolling
sum is 1
when a recession starts and -1
when it stops.
s2 = s.sub(.5).rolling(2).sum()
since both 1
and -1
evaluate to True
I can mask the rolling signal to just start and stops and ffill
. Get truth values of when they are positive or negative with gt(0)
.
pd.concat([s, s2.mask(~s2.astype(bool)).ffill().gt(0)], axis=1, keys=['signal', 'isRec'])
Similar idea using shift
, but writing the result as a single Boolean column:
# Boolean indexers for recession start and stops.
rec_start = (df['signal'] == 1) & (df['signal'].shift(-1) == 1)
rec_end = (df['signal'] == 0) & (df['signal'].shift(-1) == 0)
# Mark the recession start/stops as True/False.
df.loc[rec_start, 'recession'] = True
df.loc[rec_end, 'recession'] = False
# Forward fill the recession column with the last known Boolean.
# Fill any NaN's as False (i.e. locations before the first start/stop).
df['recession'] = df['recession'].ffill().fillna(False)
The resulting output:
signal recession
0 0 False
1 1 False
2 0 False
3 1 True
4 1 True
5 1 True
6 0 False
7 0 False
8 1 True
9 1 True
10 0 True
11 1 True
12 0 False
13 0 False
14 1 False
The start of a run of 1’s satisfies the condition
x_prev = x.shift(1)
x_next = x.shift(-1)
((x_prev != 1) & (x == 1) & (x_next == 1))
That is to say, the value at the start of a run is 1 and the previous value is not 1 and the next value is 1. Similarly, the end of a run satisfies the condition
((x == 1) & (x_next == 0) & (x_next2 == 0))
since the value at the end of a run is 1 and the next two values value are 0.
We can find indices where these conditions are true using np.flatnonzero
:
import numpy as np
import pandas as pd
x = pd.Series([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
x_prev = x.shift(1)
x_next = x.shift(-1)
x_next2 = x.shift(-2)
df = pd.DataFrame(
dict(start = np.flatnonzero((x_prev != 1) & (x == 1) & (x_next == 1)),
end = np.flatnonzero((x == 1) & (x_next == 0) & (x_next2 == 0))))
print(df[['start', 'end']])
yields
start end
0 3 5
1 8 11
You can use scipy.signal.find_peaks for this problem.
from scipy.signal import find_peaks
np_decline = np.array([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
peaks = find_peaks(np_decline,width=2)
recession_start_loc = peaks[1]['left_bases'][0]
def function2(dd:pd.DataFrame):
if dd.iat[0,1]>=2:
if dd.query("col1==0").pipe(len)==1:
return (dd.index.min(),dd.index.max()+1)
else:
dd1=dd.query("col1==1")
return (dd1.index.min(),dd1.index.max())
col2=df1.col1.diff().eq(1).cumsum()
df1.groupby(col2).apply(lambda dd:dd.assign(col3=dd.col1.cumprod().sum()))
.groupby('col3',sort=False).apply(function2).dropna()
out:
col3
3 (3, 5)
2 (8, 11)
dtype: object