Isolate sequence of positive numbers in a pandas dataframe

Question:

I would like to identify what I call "periods" of data stocked in a pandas dataframe.

Let’s say i have these values:

   values
1    0
2    8
3    1
4    0
5    5
6    6
7    4
8    7
9    0
10   2
11   9
12   1
13   0

I would like to identify sequences of strictly positive numbers with length superior or equal to 3 numbers. Each non strictly positive numbers would end an ongoing sequence.

This would give :

   values  period
1    0      None
2    8      None
3    1      None
4    0      None
5    5       1
6    6       1
7    4       1
8    7       1
9    0      None
10   2       2
11   9       2
12   1       2
13   0      None
Asked By: Mrofsnart

||

Answers:

You can try

sign = np.sign(df['values'])
m = sign.ne(sign.shift()).cumsum()  # continuous same value group

df['period'] = (df[sign.eq(1)]      # Exclude non-positive numbers
                .groupby(m)
                ['values'].filter(lambda col: len(col) >= 3)
                .groupby(m)
                .ngroup() + 1
                )
print(df)

    values  period
1        0     NaN
2        8     NaN
3        1     NaN
4        0     NaN
5        5     1.0
6        6     1.0
7        4     1.0
8        7     1.0
9        0     NaN
10       2     2.0
11       9     2.0
12       1     2.0
13       0     NaN
Answered By: Ynjxsjmh

Using boolean arithmetics:

N = 3
m1 = df['values'].le(0)
m2 = df.groupby(m1.cumsum())['values'].transform('count').gt(N)
df['period'] = (m1&m2).cumsum().where((~m1)&m2)

output:

    values  period
1        0     NaN
2        8     NaN
3        1     NaN
4        0     NaN
5        5     1.0
6        6     1.0
7        4     1.0
8        7     1.0
9        0     NaN
10       2     2.0
11       9     2.0
12       1     2.0
13       0     NaN

intermediates:

    values     m1     m2  CS(m1)  m1&m2  CS(m1&m2)  (~m1)&m2  period
1        0   True  False       1  False          0     False     NaN
2        8  False  False       1  False          0     False     NaN
3        1  False  False       1  False          0     False     NaN
4        0   True   True       2   True          1     False     NaN
5        5  False   True       2  False          1      True     1.0
6        6  False   True       2  False          1      True     1.0
7        4  False   True       2  False          1      True     1.0
8        7  False   True       2  False          1      True     1.0
9        0   True   True       3   True          2     False     NaN
10       2  False   True       3  False          2      True     2.0
11       9  False   True       3  False          2      True     2.0
12       1  False   True       3  False          2      True     2.0
13       0   True  False       4  False          2     False     NaN
Answered By: mozway

One simple approach using find_peaks to find the plateaus (positive consecutive integers) of at least size 3:

import numpy as np
import pandas as pd

from scipy.signal import find_peaks

df = pd.DataFrame.from_dict({'values': {0: 0, 1: 8, 2: 1, 3: 0, 4: 5, 5: 6, 6: 4, 7: 7, 8: 0, 9: 2, 10: 9, 11: 1, 12: 0}})

_, plateaus = find_peaks((df["values"] > 0).to_numpy(), plateau_size=3)
indices = np.arange(len(df["values"]))[:, None]
indices = (indices >= plateaus["left_edges"]) & (indices <= plateaus["right_edges"])
res = (indices * (np.arange(indices.shape[1]) + 1)).sum(axis=1)
df["periods"] = res

print(df)

Output

    values  periods
0        0        0
1        8        0
2        1        0
3        0        0
4        5        1
5        6        1
6        4        1
7        7        1
8        0        0
9        2        2
10       9        2
11       1        2
12       0        0
Answered By: Dani Mesejo

A simple solution:

count = 0
n_groups = 0
seq_idx = [None]*len(df)

for i in range(len(df)):
    
    if df.iloc[i]['values'] > 0:
        count += 1
    else:
        if count >= 3:
            n_groups += 1
            seq_idx[i-count: i] = [n_groups]*count
            
        count = 0
df['period'] = seq_idx

Output:

values  period
0   0   NaN
1   8   NaN
2   1   NaN
3   0   NaN
4   5   1.0
5   6   1.0
6   4   1.0
7   7   1.0
8   0   NaN
9   2   2.0
10  9   2.0
11  1   2.0
12  0   NaN
Answered By: bpfrd
def function1(dd:pd.DataFrame):
    dd.loc[:,'period']=None
    if len(dd)>=4:
        dd.iloc[1:,2]=dd.iloc[1:,1]
    return dd

df1.assign(col1=df1.le(0).cumsum().sub(1)).groupby('col1').apply(function1)

out:

     values  col1 period
0        0     0   None
1        8     0   None
2        1     0   None
3        0     1   None
4        5     1      1
5        6     1      1
6        4     1      1
7        7     1      1
8        0     2   None
9        2     2      2
10       9     2      2
11       1     2      2
12       0     3   None
Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.