Python pandas How to pick up certain values by internal numbering?

Question:

I have a dataframe that looks like this:

    Answers  all_answers  Score
0       0.0            0     72
1       0.0            0     73
2       0.0            0     74
3       1.0            1      1
4      -1.0            1      2
5       1.0            1      3
6      -1.0            1      4
7       1.0            1      5
8       0.0            0      1
9       0.0            0      2
10     -1.0            1      1
11      0.0            0      1
12      0.0            0      2
13      1.0            1      1
14      0.0            0      1
15      0.0            0      2
16      1.0            1      1

The first column is a signal that the sign has changed in the calculation flow

The second one is I just removed the minus from the first one

The third is an internal account for the second column – how much was one and how much was zero

I want to add a fourth column to it that would show me only those units that went in a row for example 5 times while observing the sign of the first column.

To get something like this

    Answers  all_answers  Score  New
0       0.0            0     72    0
1       0.0            0     73    0
2       0.0            0     74    0
3       1.0            1      1    1
4      -1.0            1      2   -1
5       1.0            1      3    1
6      -1.0            1      4   -1
7       1.0            1      5    1
8       0.0            0      1    0
9       0.0            0      2    0
10     -1.0            1      1    0
11      0.0            0      1    0
12      0.0            0      2    0
13      1.0            1      1    0
14      0.0            0      1    0
15      0.0            0      2    0
16      1.0            1      1    0
17      0.0            0      1    0

Is it possible to do this by Pandas ?

Asked By: Serega

||

Answers:

You can use:

# group by consecutive 0/1
g = df['all_answers'].ne(df['all_answers'].shift()).cumsum()

# get size of each group and compare to threshold
m = df.groupby(g)['all_answers'].transform('size').ge(5)

# mask small groups
df['New'] = df['Answers'].where(m, 0)

Output:

    Answers  all_answers  Score  New
0       0.0            0     72  0.0
1       0.0            0     73  0.0
2       0.0            0     74  0.0
3       1.0            1      1  1.0
4      -1.0            1      2 -1.0
5       1.0            1      3  1.0
6      -1.0            1      4 -1.0
7       1.0            1      5  1.0
8       0.0            0      1  0.0
9       0.0            0      2  0.0
10     -1.0            1      1  0.0
11      0.0            0      1  0.0
12      0.0            0      2  0.0
13      1.0            1      1  0.0
14      0.0            0      1  0.0
15      0.0            0      2  0.0
16      1.0            1      1  0.0
Answered By: mozway

A faster way (with regex):

import pandas as pd
import re


def repl5(m):
    return '5' * len(m.group())


s = df['all_answers'].astype(str).str.cat()

d = re.sub('(?:1{5,})', repl5, s)

d = [x=='5' for x in list(d)]

df['New'] = df['Answers'].where(d, 0.0)
df

Output:

    Answers  all_answers  Score  New
0       0.0            0     72  0.0
1       0.0            0     73  0.0
2       0.0            0     74  0.0
3       1.0            1      1  1.0
4      -1.0            1      2 -1.0
5       1.0            1      3  1.0
6      -1.0            1      4 -1.0
7       1.0            1      5  1.0
8       0.0            0      1  0.0
9       0.0            0      2  0.0
10     -1.0            1      1  0.0
11      0.0            0      1  0.0
12      0.0            0      2  0.0
13      1.0            1      1  0.0
14      0.0            0      1  0.0
15      0.0            0      2  0.0
16      1.0            1      1  0.0
Answered By: Shahab Rahnama
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.