How To Remove Specific Rows With Consecutive Values

Question

I have a Pandas dataframe, df_next, that is a monthly aggregation of crime type incidents for specific jurisdictions. For example, something like:

ID	Year_Month	Total
AL0010000	1991-01	2024
AL0010000	1991-02	3017
…	…	…
WV0550300	2018-11	30147
WV0550300	2018-12	32148

I want to reduce the size of my dataframe by removing rows that are part of 4 months of consecutive 0 values in the ‘Total’ column. In other words, if a ID has reported 0 total crime for four consecutive months, I want to remove that chunk of 4 months. I want to do this for all IDs.

I’ve tried:

# Define a window size of 4
window_size = 4

# Apply a rolling window to the Total column for each ID
df_next['Total_rolling'] = df_next.groupby('ID')['Total'].rolling(window=window_size).reset_index(0, drop=True)

df_next['Remove'] = ((df_next['Total_rolling'].shift(window_size - 1) == 0) & (df_next['Total_rolling'] == 0))

# Filter out the rows where there are four consecutive 0's in the Total value for each ID
df_filtered = df_next[~df_next['Remove']]

However, when I check df_filtered, I still have multiple examples of IDs with four consecutive months of 0 crime totals. Any help would be greatly appreciated.

Asked By: TheMaffGuy

||

Source

Answer 1

Annotated Code

# is total zero?
m = df['Total'] == 0

# create a counter to identify different 
# blocks of consecutive zero's
b = (~m).cumsum()

# Group the rows where total is zero by `ID` and above blocks
# and transform with size to calculate the number of consecutive zeros
s = df[m].groupby(['ID', b]).transform('size')

# Drop the rows from the original dataframe where
# there are 4 or more consecutive zeros
df = df.drop(s.index[s >= 4])

Answered By: Shubham Sharma

Answer 2

UPDATED:

Here is an updated answer based on OP’s comment indicating that all contiguous zeros in a streak of 4 or more should be dropped (not just the rows beginning when the streak length reaches 4):

d = df.sort_values(['ID','Year_Month']).pipe(
    lambda d:d.assign(isPos=d.Total.gt(0)))[['ID','isPos']]
d['cumPos'] = d.groupby('ID').isPos.cumsum()
longZeroStreak = ( d[~d.isPos].groupby(['ID','cumPos']).isPos
    .transform(lambda s: len(s) >= 4)
    .reindex(index=df.index, fill_value=False) )
res = df[~longZeroStreak]

Explanation:

Sort original dataframe by ID, Year_Month, create an isPos column with boolean indicating whether Total is positive, and create a cumPos column containing the cumsum of isPos for each ID group
Group contiguous zeros by ID and number of preceding non-zeros and create a boolean Series longZeroStreak indicating whether the length of each such zero streak is >= 4
Set the result to be the rows that aren’t part of a streak of 4 or more zeros.

ORIGINAL ANSWER:

This will do what I think your question asks:

d = df.sort_values(['ID','Year_Month']).pipe(
    lambda d:d.assign(isZero=d.Total.eq(0)))[['ID','isZero']]
d['cumZeros'] = d.groupby('ID').isZero.cumsum()
d['cumZerosAtLastBreak'] = ( d.groupby('ID').cumZeros
    .transform(lambda s: s[s.eq(s.shift(1))]) )
d['cumZerosAtLastBreak'] = ( d.groupby('ID').cumZerosAtLastBreak
    .transform(lambda s:s.ffill().fillna(0, downcast='infer')) )
res = df.loc[d.cumZeros - d.cumZerosAtLastBreak < 4,:]

Explanation:

Sort original dataframe by ID, Year_Month, add isZero column with boolean indicating whether Total is 0, and drop all columns but ID, isZero
Add cumZeros column with cumsum of isZero for each ID group
Add cumZerosAtLastBreak column which copies values from cumZeros for rows with cumZeros == cumZeros.shift(1), and otherwise is NaN, for each ID group (this gives us the cum number of zeros, but only for rows marking the break of a zero streak)
Update cumZerosAtLastBreak column by using ffill and fillna(0) (and downcast to int, just to be logically consistent) for each ID group
Filter original dataframe to keep only rows with cumZeros - cumZerosAtLastBreak < 4 (namely, rows with a zero streak length below 4).

Sample input:

           ID Year_Month  Total
0   AL0010000    2020-01      0
1   AL0010000    2020-02      0
2   AL0010000    2020-03      0
3   AL0010000    2020-04      0
4   AL0010000    2020-05    123
5   AL0010000    2020-06      0
6   AL0010000    2020-07      0
7   AL0010000    2020-08      0
8   AL0010000    2020-09      0
9   AL0010000    2020-10      0
10  AL0010000    2020-11      0
11  AL0010000    2020-12    456
12  WV0550300    2021-01      0
13  WV0550300    2021-02      0
14  WV0550300    2021-03      0
15  WV0550300    2021-04      0
16  WV0550300    2021-05    123
17  WV0550300    2021-06      0
18  WV0550300    2021-07      0
19  WV0550300    2021-08      0
20  WV0550300    2021-09      0
21  WV0550300    2021-10      0
22  WV0550300    2021-11      0
23  WV0550300    2021-12    456

Output:

           ID Year_Month  Total
0   AL0010000    2020-01      0
1   AL0010000    2020-02      0
2   AL0010000    2020-03      0
4   AL0010000    2020-05    123
5   AL0010000    2020-06      0
6   AL0010000    2020-07      0
7   AL0010000    2020-08      0
11  AL0010000    2020-12    456
12  WV0550300    2021-01      0
13  WV0550300    2021-02      0
14  WV0550300    2021-03      0
16  WV0550300    2021-05    123
17  WV0550300    2021-06      0
18  WV0550300    2021-07      0
19  WV0550300    2021-08      0
23  WV0550300    2021-12    456

Answered By: constantstranger

How To Remove Specific Rows With Consecutive Values

Question:

Answers:

Annotated Code