Count consecutive boolean values in Python/pandas array for whole subset
Question:
I am looking for a way to aggregate pandas data frame by consecutive same values and perform actions like count or max on this aggregation.
for example, if I would have one column in df:
my_column
0 0
1 0
2 1
3 1
4 1
5 0
6 0
7 0
8 0
9 1
10 1
11 0
the result needs to be:
result
0 2
1 2
2 3
3 3
4 3
5 4
6 4
7 4
8 4
9 2
10 2
11 1
Why: We have two 0 at the beginning, and three 1 next,…
What I need, is similar that this answer but for all elements in the group I need the same value.
The preferred answer would be one that shows this aggregation of the consecutive same element and applies the aggregation function to it. So that I could do even max value:
my_column other_value
0 0 7
1 0 4
2 1 1
3 1 0
4 1 5
5 0 1
6 0 1
7 0 2
8 0 8
9 1 1
10 1 0
11 0 2
and the result would be
result
0 7
1 7
2 5
3 5
4 5
5 8
6 8
7 8
8 8
9 1
10 1
11 2
Answers:
You can use :
g = df["my_column"].ne(df["my_column"].shift()).cumsum()
out = df.groupby(g)["my_column"].transform("count")
Output :
print(out)
my_column
0 2
1 2
2 3
3 3
4 3
5 4
6 4
7 4
8 4
9 2
10 2
11 1
NB : to get the max, use df.groupby(g)["other_value"].transform("max")
.
If check linked answer there is exactly way for groups by consecutive values:
(y != y.shift()).cumsum()
So if create consecutive groups per column my_column
output is:
g = df["my_column"].ne(df["my_column"].shift()).cumsum()
print (g)
0 1
1 1
2 2
3 2
4 2
5 3
6 3
7 3
8 3
9 4
10 4
11 5
Name: my_column, dtype: int32
is possible use GroupBy.transform
with Series.to_frame
for one column DataFrame
:
df1 = df.groupby(g)['my_column'].transform('size').to_frame()
print (df1)
my_column
0 2
1 2
2 3
3 3
4 3
5 4
6 4
7 4
8 4
9 2
10 2
11 1
Or Series.map
with Series.value_counts
:
df1 = g.map(g.value_counts()).to_frame()
print (df1)
my_column
0 2
1 2
2 3
3 3
4 3
5 4
6 4
7 4
8 4
9 2
10 2
11 1
Similar way for second solution:
g = df["my_column"].ne(df["my_column"].shift()).cumsum()
df1 = df.groupby(g)['other_value'].transform('max').to_frame(name='result')
print (df1)
result
0 7
1 7
2 5
3 5
4 5
5 8
6 8
7 8
8 8
9 1
10 1
11 2
I am looking for a way to aggregate pandas data frame by consecutive same values and perform actions like count or max on this aggregation.
for example, if I would have one column in df:
my_column
0 0
1 0
2 1
3 1
4 1
5 0
6 0
7 0
8 0
9 1
10 1
11 0
the result needs to be:
result
0 2
1 2
2 3
3 3
4 3
5 4
6 4
7 4
8 4
9 2
10 2
11 1
Why: We have two 0 at the beginning, and three 1 next,…
What I need, is similar that this answer but for all elements in the group I need the same value.
The preferred answer would be one that shows this aggregation of the consecutive same element and applies the aggregation function to it. So that I could do even max value:
my_column other_value
0 0 7
1 0 4
2 1 1
3 1 0
4 1 5
5 0 1
6 0 1
7 0 2
8 0 8
9 1 1
10 1 0
11 0 2
and the result would be
result
0 7
1 7
2 5
3 5
4 5
5 8
6 8
7 8
8 8
9 1
10 1
11 2
You can use :
g = df["my_column"].ne(df["my_column"].shift()).cumsum()
out = df.groupby(g)["my_column"].transform("count")
Output :
print(out)
my_column
0 2
1 2
2 3
3 3
4 3
5 4
6 4
7 4
8 4
9 2
10 2
11 1
NB : to get the max, use df.groupby(g)["other_value"].transform("max")
.
If check linked answer there is exactly way for groups by consecutive values:
(y != y.shift()).cumsum()
So if create consecutive groups per column my_column
output is:
g = df["my_column"].ne(df["my_column"].shift()).cumsum()
print (g)
0 1
1 1
2 2
3 2
4 2
5 3
6 3
7 3
8 3
9 4
10 4
11 5
Name: my_column, dtype: int32
is possible use GroupBy.transform
with Series.to_frame
for one column DataFrame
:
df1 = df.groupby(g)['my_column'].transform('size').to_frame()
print (df1)
my_column
0 2
1 2
2 3
3 3
4 3
5 4
6 4
7 4
8 4
9 2
10 2
11 1
Or Series.map
with Series.value_counts
:
df1 = g.map(g.value_counts()).to_frame()
print (df1)
my_column
0 2
1 2
2 3
3 3
4 3
5 4
6 4
7 4
8 4
9 2
10 2
11 1
Similar way for second solution:
g = df["my_column"].ne(df["my_column"].shift()).cumsum()
df1 = df.groupby(g)['other_value'].transform('max').to_frame(name='result')
print (df1)
result
0 7
1 7
2 5
3 5
4 5
5 8
6 8
7 8
8 8
9 1
10 1
11 2