Calculating length of sequence of zeros in Pandas
Question:
I have a table like this
Unit
status
date
One
1
1
One
1
2
One
1
3
One
0
4
One
0
5
One
1
6
One
1
7
and I want to create a new column where I’d have the size of the sequence of zeros from the status
column. So for that example, the output would be
Unit
status
date
gap
One
1
1
0
One
1
2
0
One
1
3
0
One
0
4
2
One
0
5
2
One
1
6
0
One
1
7
0
This would have to be done for all the units in the DataFrame. I was basing myself on this question, but I’m stuck in the part where I set the total size for all the rows that are part of the gap
Answers:
The usual way to group the block of some values is to cumsum
on the other values. Given that your data is sorted by Unit
:
df['gap'] = (df.groupby(['Unit', 'status', df['status'].cumsum()])
['status'].transform('size')
.where(df['status'].eq(0), other=0)
)
Output:
Unit status date gap
0 One 1 1 0
1 One 1 2 0
2 One 1 3 0
3 One 0 4 2
4 One 0 5 2
5 One 1 6 0
6 One 1 7 0
Another approach could be to use run-length encoding via package python-rle
:
import rle
r = rle.encode(df.status)
df['gap'] = (rle
.decode([r[1][x] if r[0][x] == 0 else 0 for x in range(len(r[0]))], r[1]))
Output:
Unit status date gap
0 One 1 1 0
1 One 1 2 0
2 One 1 3 0
3 One 0 4 2
4 One 0 5 2
5 One 1 6 0
6 One 1 7 0
I have a table like this
Unit | status | date |
---|---|---|
One | 1 | 1 |
One | 1 | 2 |
One | 1 | 3 |
One | 0 | 4 |
One | 0 | 5 |
One | 1 | 6 |
One | 1 | 7 |
and I want to create a new column where I’d have the size of the sequence of zeros from the status
column. So for that example, the output would be
Unit | status | date | gap |
---|---|---|---|
One | 1 | 1 | 0 |
One | 1 | 2 | 0 |
One | 1 | 3 | 0 |
One | 0 | 4 | 2 |
One | 0 | 5 | 2 |
One | 1 | 6 | 0 |
One | 1 | 7 | 0 |
This would have to be done for all the units in the DataFrame. I was basing myself on this question, but I’m stuck in the part where I set the total size for all the rows that are part of the gap
The usual way to group the block of some values is to cumsum
on the other values. Given that your data is sorted by Unit
:
df['gap'] = (df.groupby(['Unit', 'status', df['status'].cumsum()])
['status'].transform('size')
.where(df['status'].eq(0), other=0)
)
Output:
Unit status date gap
0 One 1 1 0
1 One 1 2 0
2 One 1 3 0
3 One 0 4 2
4 One 0 5 2
5 One 1 6 0
6 One 1 7 0
Another approach could be to use run-length encoding via package python-rle
:
import rle
r = rle.encode(df.status)
df['gap'] = (rle
.decode([r[1][x] if r[0][x] == 0 else 0 for x in range(len(r[0]))], r[1]))
Output:
Unit status date gap
0 One 1 1 0
1 One 1 2 0
2 One 1 3 0
3 One 0 4 2
4 One 0 5 2
5 One 1 6 0
6 One 1 7 0