Count consecutive repeated values in pandas
Question:
I’m trying to highlight areas in Matplotlib where the data in a pandas data frame is same over a consecutive number of rows, so given the data frame below and a threshold of 3:
In
days = pd.date_range(dt.datetime.now(), dt.datetime.now() + dt.timedelta(13), freq='D')
data = [2,3,3,3,2,2,3.4,3.1,2.7,np.nan,4,4,4,4.5]
df = pd.DataFrame({'cat': data})
df = df.set_index(days)
out:
col
2021-03-12 15:13:24.727074 2.0
2021-03-13 15:13:24.727074 3.0
2021-03-14 15:13:24.727074 3.0
2021-03-15 15:13:24.727074 3.0
2021-03-16 15:13:24.727074 2.0
2021-03-17 15:13:24.727074 2.0
2021-03-18 15:13:24.727074 3.4
2021-03-19 15:13:24.727074 3.1
2021-03-20 15:13:24.727074 2.7
2021-03-21 15:13:24.727074 NaN
2021-03-22 15:13:24.727074 4.0
2021-03-23 15:13:24.727074 4.0
2021-03-24 15:13:24.727074 4.0
2021-03-25 15:13:24.727074 4.5
The ultimate objective would be to return the following dataframe, where ‘result’ was a test to see if the data in ‘col’ was not changing. The 2 consecutive values of 2.0 don’t flag because they are only 2 consecutive instances vs our threshold of >= 3.
col result
2021-03-12 15:13:24.727074 2.0 False
2021-03-13 15:13:24.727074 3.0 True
2021-03-14 15:13:24.727074 3.0 True
2021-03-15 15:13:24.727074 3.0 True
2021-03-16 15:13:24.727074 2.0 False
2021-03-17 15:13:24.727074 2.0 False
2021-03-18 15:13:24.727074 3.4 False
2021-03-19 15:13:24.727074 3.1 False
2021-03-20 15:13:24.727074 2.7 False
2021-03-21 15:13:24.727074 NaN False
2021-03-22 15:13:24.727074 4.0 True
2021-03-23 15:13:24.727074 4.0 True
2021-03-24 15:13:24.727074 4.0 True
2021-03-25 15:13:24.727074 4.5 False
I tried using cumsum() below and incrmented by 1 when there is a difference. Using the following code:
df['increment'] = (df['col'].diff(1) != 0).astype('int').cumsum()
This works to get the size of the consecutive blocks using
df.groupby('increment').size() >= threshold
This gets me close but the problem is it breaks my link with my original dataframe datetime index, which means i can’t plot the boolean data together with the original df[‘col’].
Answers:
Use cumsum()
on the comparison with shift
to identify the blocks:
# groupby exact match of values
blocks = df['col'].ne(df['col'].shift()).cumsum()
df['result'] = blocks.groupby(blocks).transform('size') >= 3
Output:
col result
2021-03-12 15:13:24.727074 2.0 False
2021-03-13 15:13:24.727074 3.0 True
2021-03-14 15:13:24.727074 3.0 True
2021-03-15 15:13:24.727074 3.0 True
2021-03-16 15:13:24.727074 2.0 False
2021-03-17 15:13:24.727074 2.0 False
2021-03-18 15:13:24.727074 3.4 False
2021-03-19 15:13:24.727074 3.1 False
2021-03-20 15:13:24.727074 2.7 False
2021-03-21 15:13:24.727074 NaN False
2021-03-22 15:13:24.727074 4.0 True
2021-03-23 15:13:24.727074 4.0 True
2021-03-24 15:13:24.727074 4.0 True
2021-03-25 15:13:24.727074 4.5 False
Note It’s not ideal to use ==
to compare floats. Instead, we can use threshold, something like:
# groupby consecutive rows if the differences are not significant
blocks = df['col'].diff().abs().gt(1e-6).cumsum()
Boolean select by testing consecutive similarity using shift. Apply cumsum to convert to groups. Use the resulting group to groupby. Apply transform to find size.
df=df.assign(result=df.groupby((~df.cat.eq(df.cat.shift())).cumsum())['cat'].transform('size').ge(3))
cat result
2021-03-13 05:32:30.309303 2.0 False
2021-03-14 05:32:30.309303 3.0 True
2021-03-15 05:32:30.309303 3.0 True
2021-03-16 05:32:30.309303 3.0 True
2021-03-17 05:32:30.309303 2.0 False
2021-03-18 05:32:30.309303 2.0 False
2021-03-19 05:32:30.309303 3.4 False
2021-03-20 05:32:30.309303 3.1 False
2021-03-21 05:32:30.309303 2.7 False
2021-03-22 05:32:30.309303 NaN False
2021-03-23 05:32:30.309303 4.0 True
2021-03-24 05:32:30.309303 4.0 True
2021-03-25 05:32:30.309303 4.0 True
2021-03-26 05:32:30.309303 4.5 False
def function1(ss:pd.Series):
if ss.iloc[1:].all():
df1.loc[ss.index,'result']=True
return 0
df1.col.diff().eq(0).rolling(3).apply(function1).pipe(lambda dd:df1.assign(result=df1.result.fillna(False)))
Output:
col result
2021-03-12 15:13:24.727074 2.0 False
2021-03-13 15:13:24.727074 3.0 True
2021-03-14 15:13:24.727074 3.0 True
2021-03-15 15:13:24.727074 3.0 True
2021-03-16 15:13:24.727074 2.0 False
2021-03-17 15:13:24.727074 2.0 False
2021-03-18 15:13:24.727074 3.4 False
2021-03-19 15:13:24.727074 3.1 False
2021-03-20 15:13:24.727074 2.7 False
2021-03-21 15:13:24.727074 NaN False
2021-03-22 15:13:24.727074 4.0 True
2021-03-23 15:13:24.727074 4.0 True
2021-03-24 15:13:24.727074 4.0 True
2021-03-25 15:13:24.727074 4.5 False
I’m trying to highlight areas in Matplotlib where the data in a pandas data frame is same over a consecutive number of rows, so given the data frame below and a threshold of 3:
In
days = pd.date_range(dt.datetime.now(), dt.datetime.now() + dt.timedelta(13), freq='D')
data = [2,3,3,3,2,2,3.4,3.1,2.7,np.nan,4,4,4,4.5]
df = pd.DataFrame({'cat': data})
df = df.set_index(days)
out:
col
2021-03-12 15:13:24.727074 2.0
2021-03-13 15:13:24.727074 3.0
2021-03-14 15:13:24.727074 3.0
2021-03-15 15:13:24.727074 3.0
2021-03-16 15:13:24.727074 2.0
2021-03-17 15:13:24.727074 2.0
2021-03-18 15:13:24.727074 3.4
2021-03-19 15:13:24.727074 3.1
2021-03-20 15:13:24.727074 2.7
2021-03-21 15:13:24.727074 NaN
2021-03-22 15:13:24.727074 4.0
2021-03-23 15:13:24.727074 4.0
2021-03-24 15:13:24.727074 4.0
2021-03-25 15:13:24.727074 4.5
The ultimate objective would be to return the following dataframe, where ‘result’ was a test to see if the data in ‘col’ was not changing. The 2 consecutive values of 2.0 don’t flag because they are only 2 consecutive instances vs our threshold of >= 3.
col result
2021-03-12 15:13:24.727074 2.0 False
2021-03-13 15:13:24.727074 3.0 True
2021-03-14 15:13:24.727074 3.0 True
2021-03-15 15:13:24.727074 3.0 True
2021-03-16 15:13:24.727074 2.0 False
2021-03-17 15:13:24.727074 2.0 False
2021-03-18 15:13:24.727074 3.4 False
2021-03-19 15:13:24.727074 3.1 False
2021-03-20 15:13:24.727074 2.7 False
2021-03-21 15:13:24.727074 NaN False
2021-03-22 15:13:24.727074 4.0 True
2021-03-23 15:13:24.727074 4.0 True
2021-03-24 15:13:24.727074 4.0 True
2021-03-25 15:13:24.727074 4.5 False
I tried using cumsum() below and incrmented by 1 when there is a difference. Using the following code:
df['increment'] = (df['col'].diff(1) != 0).astype('int').cumsum()
This works to get the size of the consecutive blocks using
df.groupby('increment').size() >= threshold
This gets me close but the problem is it breaks my link with my original dataframe datetime index, which means i can’t plot the boolean data together with the original df[‘col’].
Use cumsum()
on the comparison with shift
to identify the blocks:
# groupby exact match of values
blocks = df['col'].ne(df['col'].shift()).cumsum()
df['result'] = blocks.groupby(blocks).transform('size') >= 3
Output:
col result
2021-03-12 15:13:24.727074 2.0 False
2021-03-13 15:13:24.727074 3.0 True
2021-03-14 15:13:24.727074 3.0 True
2021-03-15 15:13:24.727074 3.0 True
2021-03-16 15:13:24.727074 2.0 False
2021-03-17 15:13:24.727074 2.0 False
2021-03-18 15:13:24.727074 3.4 False
2021-03-19 15:13:24.727074 3.1 False
2021-03-20 15:13:24.727074 2.7 False
2021-03-21 15:13:24.727074 NaN False
2021-03-22 15:13:24.727074 4.0 True
2021-03-23 15:13:24.727074 4.0 True
2021-03-24 15:13:24.727074 4.0 True
2021-03-25 15:13:24.727074 4.5 False
Note It’s not ideal to use ==
to compare floats. Instead, we can use threshold, something like:
# groupby consecutive rows if the differences are not significant
blocks = df['col'].diff().abs().gt(1e-6).cumsum()
Boolean select by testing consecutive similarity using shift. Apply cumsum to convert to groups. Use the resulting group to groupby. Apply transform to find size.
df=df.assign(result=df.groupby((~df.cat.eq(df.cat.shift())).cumsum())['cat'].transform('size').ge(3))
cat result
2021-03-13 05:32:30.309303 2.0 False
2021-03-14 05:32:30.309303 3.0 True
2021-03-15 05:32:30.309303 3.0 True
2021-03-16 05:32:30.309303 3.0 True
2021-03-17 05:32:30.309303 2.0 False
2021-03-18 05:32:30.309303 2.0 False
2021-03-19 05:32:30.309303 3.4 False
2021-03-20 05:32:30.309303 3.1 False
2021-03-21 05:32:30.309303 2.7 False
2021-03-22 05:32:30.309303 NaN False
2021-03-23 05:32:30.309303 4.0 True
2021-03-24 05:32:30.309303 4.0 True
2021-03-25 05:32:30.309303 4.0 True
2021-03-26 05:32:30.309303 4.5 False
def function1(ss:pd.Series):
if ss.iloc[1:].all():
df1.loc[ss.index,'result']=True
return 0
df1.col.diff().eq(0).rolling(3).apply(function1).pipe(lambda dd:df1.assign(result=df1.result.fillna(False)))
Output:
col result
2021-03-12 15:13:24.727074 2.0 False
2021-03-13 15:13:24.727074 3.0 True
2021-03-14 15:13:24.727074 3.0 True
2021-03-15 15:13:24.727074 3.0 True
2021-03-16 15:13:24.727074 2.0 False
2021-03-17 15:13:24.727074 2.0 False
2021-03-18 15:13:24.727074 3.4 False
2021-03-19 15:13:24.727074 3.1 False
2021-03-20 15:13:24.727074 2.7 False
2021-03-21 15:13:24.727074 NaN False
2021-03-22 15:13:24.727074 4.0 True
2021-03-23 15:13:24.727074 4.0 True
2021-03-24 15:13:24.727074 4.0 True
2021-03-25 15:13:24.727074 4.5 False