Isolate sequence of positive numbers in a pandas dataframe
Question:
I would like to identify what I call "periods" of data stocked in a pandas dataframe.
Let’s say i have these values:
values
1 0
2 8
3 1
4 0
5 5
6 6
7 4
8 7
9 0
10 2
11 9
12 1
13 0
I would like to identify sequences of strictly positive numbers with length superior or equal to 3 numbers. Each non strictly positive numbers would end an ongoing sequence.
This would give :
values period
1 0 None
2 8 None
3 1 None
4 0 None
5 5 1
6 6 1
7 4 1
8 7 1
9 0 None
10 2 2
11 9 2
12 1 2
13 0 None
Answers:
You can try
sign = np.sign(df['values'])
m = sign.ne(sign.shift()).cumsum() # continuous same value group
df['period'] = (df[sign.eq(1)] # Exclude non-positive numbers
.groupby(m)
['values'].filter(lambda col: len(col) >= 3)
.groupby(m)
.ngroup() + 1
)
print(df)
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
Using boolean arithmetics:
N = 3
m1 = df['values'].le(0)
m2 = df.groupby(m1.cumsum())['values'].transform('count').gt(N)
df['period'] = (m1&m2).cumsum().where((~m1)&m2)
output:
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
intermediates:
values m1 m2 CS(m1) m1&m2 CS(m1&m2) (~m1)&m2 period
1 0 True False 1 False 0 False NaN
2 8 False False 1 False 0 False NaN
3 1 False False 1 False 0 False NaN
4 0 True True 2 True 1 False NaN
5 5 False True 2 False 1 True 1.0
6 6 False True 2 False 1 True 1.0
7 4 False True 2 False 1 True 1.0
8 7 False True 2 False 1 True 1.0
9 0 True True 3 True 2 False NaN
10 2 False True 3 False 2 True 2.0
11 9 False True 3 False 2 True 2.0
12 1 False True 3 False 2 True 2.0
13 0 True False 4 False 2 False NaN
One simple approach using find_peaks
to find the plateaus (positive consecutive integers) of at least size 3:
import numpy as np
import pandas as pd
from scipy.signal import find_peaks
df = pd.DataFrame.from_dict({'values': {0: 0, 1: 8, 2: 1, 3: 0, 4: 5, 5: 6, 6: 4, 7: 7, 8: 0, 9: 2, 10: 9, 11: 1, 12: 0}})
_, plateaus = find_peaks((df["values"] > 0).to_numpy(), plateau_size=3)
indices = np.arange(len(df["values"]))[:, None]
indices = (indices >= plateaus["left_edges"]) & (indices <= plateaus["right_edges"])
res = (indices * (np.arange(indices.shape[1]) + 1)).sum(axis=1)
df["periods"] = res
print(df)
Output
values periods
0 0 0
1 8 0
2 1 0
3 0 0
4 5 1
5 6 1
6 4 1
7 7 1
8 0 0
9 2 2
10 9 2
11 1 2
12 0 0
A simple solution:
count = 0
n_groups = 0
seq_idx = [None]*len(df)
for i in range(len(df)):
if df.iloc[i]['values'] > 0:
count += 1
else:
if count >= 3:
n_groups += 1
seq_idx[i-count: i] = [n_groups]*count
count = 0
df['period'] = seq_idx
Output:
values period
0 0 NaN
1 8 NaN
2 1 NaN
3 0 NaN
4 5 1.0
5 6 1.0
6 4 1.0
7 7 1.0
8 0 NaN
9 2 2.0
10 9 2.0
11 1 2.0
12 0 NaN
def function1(dd:pd.DataFrame):
dd.loc[:,'period']=None
if len(dd)>=4:
dd.iloc[1:,2]=dd.iloc[1:,1]
return dd
df1.assign(col1=df1.le(0).cumsum().sub(1)).groupby('col1').apply(function1)
out:
values col1 period
0 0 0 None
1 8 0 None
2 1 0 None
3 0 1 None
4 5 1 1
5 6 1 1
6 4 1 1
7 7 1 1
8 0 2 None
9 2 2 2
10 9 2 2
11 1 2 2
12 0 3 None
I would like to identify what I call "periods" of data stocked in a pandas dataframe.
Let’s say i have these values:
values
1 0
2 8
3 1
4 0
5 5
6 6
7 4
8 7
9 0
10 2
11 9
12 1
13 0
I would like to identify sequences of strictly positive numbers with length superior or equal to 3 numbers. Each non strictly positive numbers would end an ongoing sequence.
This would give :
values period
1 0 None
2 8 None
3 1 None
4 0 None
5 5 1
6 6 1
7 4 1
8 7 1
9 0 None
10 2 2
11 9 2
12 1 2
13 0 None
You can try
sign = np.sign(df['values'])
m = sign.ne(sign.shift()).cumsum() # continuous same value group
df['period'] = (df[sign.eq(1)] # Exclude non-positive numbers
.groupby(m)
['values'].filter(lambda col: len(col) >= 3)
.groupby(m)
.ngroup() + 1
)
print(df)
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
Using boolean arithmetics:
N = 3
m1 = df['values'].le(0)
m2 = df.groupby(m1.cumsum())['values'].transform('count').gt(N)
df['period'] = (m1&m2).cumsum().where((~m1)&m2)
output:
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
intermediates:
values m1 m2 CS(m1) m1&m2 CS(m1&m2) (~m1)&m2 period
1 0 True False 1 False 0 False NaN
2 8 False False 1 False 0 False NaN
3 1 False False 1 False 0 False NaN
4 0 True True 2 True 1 False NaN
5 5 False True 2 False 1 True 1.0
6 6 False True 2 False 1 True 1.0
7 4 False True 2 False 1 True 1.0
8 7 False True 2 False 1 True 1.0
9 0 True True 3 True 2 False NaN
10 2 False True 3 False 2 True 2.0
11 9 False True 3 False 2 True 2.0
12 1 False True 3 False 2 True 2.0
13 0 True False 4 False 2 False NaN
One simple approach using find_peaks
to find the plateaus (positive consecutive integers) of at least size 3:
import numpy as np
import pandas as pd
from scipy.signal import find_peaks
df = pd.DataFrame.from_dict({'values': {0: 0, 1: 8, 2: 1, 3: 0, 4: 5, 5: 6, 6: 4, 7: 7, 8: 0, 9: 2, 10: 9, 11: 1, 12: 0}})
_, plateaus = find_peaks((df["values"] > 0).to_numpy(), plateau_size=3)
indices = np.arange(len(df["values"]))[:, None]
indices = (indices >= plateaus["left_edges"]) & (indices <= plateaus["right_edges"])
res = (indices * (np.arange(indices.shape[1]) + 1)).sum(axis=1)
df["periods"] = res
print(df)
Output
values periods
0 0 0
1 8 0
2 1 0
3 0 0
4 5 1
5 6 1
6 4 1
7 7 1
8 0 0
9 2 2
10 9 2
11 1 2
12 0 0
A simple solution:
count = 0
n_groups = 0
seq_idx = [None]*len(df)
for i in range(len(df)):
if df.iloc[i]['values'] > 0:
count += 1
else:
if count >= 3:
n_groups += 1
seq_idx[i-count: i] = [n_groups]*count
count = 0
df['period'] = seq_idx
Output:
values period
0 0 NaN
1 8 NaN
2 1 NaN
3 0 NaN
4 5 1.0
5 6 1.0
6 4 1.0
7 7 1.0
8 0 NaN
9 2 2.0
10 9 2.0
11 1 2.0
12 0 NaN
def function1(dd:pd.DataFrame):
dd.loc[:,'period']=None
if len(dd)>=4:
dd.iloc[1:,2]=dd.iloc[1:,1]
return dd
df1.assign(col1=df1.le(0).cumsum().sub(1)).groupby('col1').apply(function1)
out:
values col1 period
0 0 0 None
1 8 0 None
2 1 0 None
3 0 1 None
4 5 1 1
5 6 1 1
6 4 1 1
7 7 1 1
8 0 2 None
9 2 2 2
10 9 2 2
11 1 2 2
12 0 3 None