How To Remove Specific Rows With Consecutive Values
Question:
I have a Pandas dataframe, df_next, that is a monthly aggregation of crime type incidents for specific jurisdictions. For example, something like:
ID
Year_Month
Total
AL0010000
1991-01
2024
AL0010000
1991-02
3017
…
…
…
WV0550300
2018-11
30147
WV0550300
2018-12
32148
I want to reduce the size of my dataframe by removing rows that are part of 4 months of consecutive 0 values in the ‘Total’ column. In other words, if a ID has reported 0 total crime for four consecutive months, I want to remove that chunk of 4 months. I want to do this for all IDs.
I’ve tried:
# Define a window size of 4
window_size = 4
# Apply a rolling window to the Total column for each ID
df_next['Total_rolling'] = df_next.groupby('ID')['Total'].rolling(window=window_size).reset_index(0, drop=True)
df_next['Remove'] = ((df_next['Total_rolling'].shift(window_size - 1) == 0) & (df_next['Total_rolling'] == 0))
# Filter out the rows where there are four consecutive 0's in the Total value for each ID
df_filtered = df_next[~df_next['Remove']]
However, when I check df_filtered, I still have multiple examples of IDs with four consecutive months of 0 crime totals. Any help would be greatly appreciated.
Answers:
Annotated Code
# is total zero?
m = df['Total'] == 0
# create a counter to identify different
# blocks of consecutive zero's
b = (~m).cumsum()
# Group the rows where total is zero by `ID` and above blocks
# and transform with size to calculate the number of consecutive zeros
s = df[m].groupby(['ID', b]).transform('size')
# Drop the rows from the original dataframe where
# there are 4 or more consecutive zeros
df = df.drop(s.index[s >= 4])
UPDATED:
Here is an updated answer based on OP’s comment indicating that all contiguous zeros in a streak of 4 or more should be dropped (not just the rows beginning when the streak length reaches 4):
d = df.sort_values(['ID','Year_Month']).pipe(
lambda d:d.assign(isPos=d.Total.gt(0)))[['ID','isPos']]
d['cumPos'] = d.groupby('ID').isPos.cumsum()
longZeroStreak = ( d[~d.isPos].groupby(['ID','cumPos']).isPos
.transform(lambda s: len(s) >= 4)
.reindex(index=df.index, fill_value=False) )
res = df[~longZeroStreak]
Explanation:
- Sort original dataframe by
ID, Year_Month
, create an isPos
column with boolean indicating whether Total
is positive, and create a cumPos
column containing the cumsum
of isPos
for each ID
group
- Group contiguous zeros by
ID
and number of preceding non-zeros and create a boolean Series longZeroStreak
indicating whether the length of each such zero streak is >= 4
- Set the result to be the rows that aren’t part of a streak of 4 or more zeros.
ORIGINAL ANSWER:
This will do what I think your question asks:
d = df.sort_values(['ID','Year_Month']).pipe(
lambda d:d.assign(isZero=d.Total.eq(0)))[['ID','isZero']]
d['cumZeros'] = d.groupby('ID').isZero.cumsum()
d['cumZerosAtLastBreak'] = ( d.groupby('ID').cumZeros
.transform(lambda s: s[s.eq(s.shift(1))]) )
d['cumZerosAtLastBreak'] = ( d.groupby('ID').cumZerosAtLastBreak
.transform(lambda s:s.ffill().fillna(0, downcast='infer')) )
res = df.loc[d.cumZeros - d.cumZerosAtLastBreak < 4,:]
Explanation:
- Sort original dataframe by
ID, Year_Month
, add isZero
column with boolean indicating whether Total
is 0, and drop all columns but ID, isZero
- Add
cumZeros
column with cumsum
of isZero
for each ID
group
- Add
cumZerosAtLastBreak
column which copies values from cumZeros
for rows with cumZeros == cumZeros.shift(1)
, and otherwise is NaN, for each ID
group (this gives us the cum number of zeros, but only for rows marking the break of a zero streak)
- Update
cumZerosAtLastBreak
column by using ffill
and fillna(0)
(and downcast to int, just to be logically consistent) for each ID
group
- Filter original dataframe to keep only rows with
cumZeros - cumZerosAtLastBreak < 4
(namely, rows with a zero streak length below 4).
Sample input:
ID Year_Month Total
0 AL0010000 2020-01 0
1 AL0010000 2020-02 0
2 AL0010000 2020-03 0
3 AL0010000 2020-04 0
4 AL0010000 2020-05 123
5 AL0010000 2020-06 0
6 AL0010000 2020-07 0
7 AL0010000 2020-08 0
8 AL0010000 2020-09 0
9 AL0010000 2020-10 0
10 AL0010000 2020-11 0
11 AL0010000 2020-12 456
12 WV0550300 2021-01 0
13 WV0550300 2021-02 0
14 WV0550300 2021-03 0
15 WV0550300 2021-04 0
16 WV0550300 2021-05 123
17 WV0550300 2021-06 0
18 WV0550300 2021-07 0
19 WV0550300 2021-08 0
20 WV0550300 2021-09 0
21 WV0550300 2021-10 0
22 WV0550300 2021-11 0
23 WV0550300 2021-12 456
Output:
ID Year_Month Total
0 AL0010000 2020-01 0
1 AL0010000 2020-02 0
2 AL0010000 2020-03 0
4 AL0010000 2020-05 123
5 AL0010000 2020-06 0
6 AL0010000 2020-07 0
7 AL0010000 2020-08 0
11 AL0010000 2020-12 456
12 WV0550300 2021-01 0
13 WV0550300 2021-02 0
14 WV0550300 2021-03 0
16 WV0550300 2021-05 123
17 WV0550300 2021-06 0
18 WV0550300 2021-07 0
19 WV0550300 2021-08 0
23 WV0550300 2021-12 456
I have a Pandas dataframe, df_next, that is a monthly aggregation of crime type incidents for specific jurisdictions. For example, something like:
ID | Year_Month | Total |
---|---|---|
AL0010000 | 1991-01 | 2024 |
AL0010000 | 1991-02 | 3017 |
… | … | … |
WV0550300 | 2018-11 | 30147 |
WV0550300 | 2018-12 | 32148 |
I want to reduce the size of my dataframe by removing rows that are part of 4 months of consecutive 0 values in the ‘Total’ column. In other words, if a ID has reported 0 total crime for four consecutive months, I want to remove that chunk of 4 months. I want to do this for all IDs.
I’ve tried:
# Define a window size of 4
window_size = 4
# Apply a rolling window to the Total column for each ID
df_next['Total_rolling'] = df_next.groupby('ID')['Total'].rolling(window=window_size).reset_index(0, drop=True)
df_next['Remove'] = ((df_next['Total_rolling'].shift(window_size - 1) == 0) & (df_next['Total_rolling'] == 0))
# Filter out the rows where there are four consecutive 0's in the Total value for each ID
df_filtered = df_next[~df_next['Remove']]
However, when I check df_filtered, I still have multiple examples of IDs with four consecutive months of 0 crime totals. Any help would be greatly appreciated.
Annotated Code
# is total zero?
m = df['Total'] == 0
# create a counter to identify different
# blocks of consecutive zero's
b = (~m).cumsum()
# Group the rows where total is zero by `ID` and above blocks
# and transform with size to calculate the number of consecutive zeros
s = df[m].groupby(['ID', b]).transform('size')
# Drop the rows from the original dataframe where
# there are 4 or more consecutive zeros
df = df.drop(s.index[s >= 4])
UPDATED:
Here is an updated answer based on OP’s comment indicating that all contiguous zeros in a streak of 4 or more should be dropped (not just the rows beginning when the streak length reaches 4):
d = df.sort_values(['ID','Year_Month']).pipe(
lambda d:d.assign(isPos=d.Total.gt(0)))[['ID','isPos']]
d['cumPos'] = d.groupby('ID').isPos.cumsum()
longZeroStreak = ( d[~d.isPos].groupby(['ID','cumPos']).isPos
.transform(lambda s: len(s) >= 4)
.reindex(index=df.index, fill_value=False) )
res = df[~longZeroStreak]
Explanation:
- Sort original dataframe by
ID, Year_Month
, create anisPos
column with boolean indicating whetherTotal
is positive, and create acumPos
column containing thecumsum
ofisPos
for eachID
group - Group contiguous zeros by
ID
and number of preceding non-zeros and create a boolean SerieslongZeroStreak
indicating whether the length of each such zero streak is>= 4
- Set the result to be the rows that aren’t part of a streak of 4 or more zeros.
ORIGINAL ANSWER:
This will do what I think your question asks:
d = df.sort_values(['ID','Year_Month']).pipe(
lambda d:d.assign(isZero=d.Total.eq(0)))[['ID','isZero']]
d['cumZeros'] = d.groupby('ID').isZero.cumsum()
d['cumZerosAtLastBreak'] = ( d.groupby('ID').cumZeros
.transform(lambda s: s[s.eq(s.shift(1))]) )
d['cumZerosAtLastBreak'] = ( d.groupby('ID').cumZerosAtLastBreak
.transform(lambda s:s.ffill().fillna(0, downcast='infer')) )
res = df.loc[d.cumZeros - d.cumZerosAtLastBreak < 4,:]
Explanation:
- Sort original dataframe by
ID, Year_Month
, addisZero
column with boolean indicating whetherTotal
is 0, and drop all columns butID, isZero
- Add
cumZeros
column withcumsum
ofisZero
for eachID
group - Add
cumZerosAtLastBreak
column which copies values fromcumZeros
for rows withcumZeros == cumZeros.shift(1)
, and otherwise is NaN, for eachID
group (this gives us the cum number of zeros, but only for rows marking the break of a zero streak) - Update
cumZerosAtLastBreak
column by usingffill
andfillna(0)
(and downcast to int, just to be logically consistent) for eachID
group - Filter original dataframe to keep only rows with
cumZeros - cumZerosAtLastBreak < 4
(namely, rows with a zero streak length below 4).
Sample input:
ID Year_Month Total
0 AL0010000 2020-01 0
1 AL0010000 2020-02 0
2 AL0010000 2020-03 0
3 AL0010000 2020-04 0
4 AL0010000 2020-05 123
5 AL0010000 2020-06 0
6 AL0010000 2020-07 0
7 AL0010000 2020-08 0
8 AL0010000 2020-09 0
9 AL0010000 2020-10 0
10 AL0010000 2020-11 0
11 AL0010000 2020-12 456
12 WV0550300 2021-01 0
13 WV0550300 2021-02 0
14 WV0550300 2021-03 0
15 WV0550300 2021-04 0
16 WV0550300 2021-05 123
17 WV0550300 2021-06 0
18 WV0550300 2021-07 0
19 WV0550300 2021-08 0
20 WV0550300 2021-09 0
21 WV0550300 2021-10 0
22 WV0550300 2021-11 0
23 WV0550300 2021-12 456
Output:
ID Year_Month Total
0 AL0010000 2020-01 0
1 AL0010000 2020-02 0
2 AL0010000 2020-03 0
4 AL0010000 2020-05 123
5 AL0010000 2020-06 0
6 AL0010000 2020-07 0
7 AL0010000 2020-08 0
11 AL0010000 2020-12 456
12 WV0550300 2021-01 0
13 WV0550300 2021-02 0
14 WV0550300 2021-03 0
16 WV0550300 2021-05 123
17 WV0550300 2021-06 0
18 WV0550300 2021-07 0
19 WV0550300 2021-08 0
23 WV0550300 2021-12 456