How To Remove Specific Rows With Consecutive Values

Question:

I have a Pandas dataframe, df_next, that is a monthly aggregation of crime type incidents for specific jurisdictions. For example, something like:

ID Year_Month Total
AL0010000 1991-01 2024
AL0010000 1991-02 3017
WV0550300 2018-11 30147
WV0550300 2018-12 32148

I want to reduce the size of my dataframe by removing rows that are part of 4 months of consecutive 0 values in the ‘Total’ column. In other words, if a ID has reported 0 total crime for four consecutive months, I want to remove that chunk of 4 months. I want to do this for all IDs.

I’ve tried:

# Define a window size of 4
window_size = 4

# Apply a rolling window to the Total column for each ID
df_next['Total_rolling'] = df_next.groupby('ID')['Total'].rolling(window=window_size).reset_index(0, drop=True)

df_next['Remove'] = ((df_next['Total_rolling'].shift(window_size - 1) == 0) & (df_next['Total_rolling'] == 0))

# Filter out the rows where there are four consecutive 0's in the Total value for each ID
df_filtered = df_next[~df_next['Remove']]

However, when I check df_filtered, I still have multiple examples of IDs with four consecutive months of 0 crime totals. Any help would be greatly appreciated.

Asked By: TheMaffGuy

||

Answers:

Annotated Code

# is total zero?
m = df['Total'] == 0

# create a counter to identify different 
# blocks of consecutive zero's
b = (~m).cumsum()

# Group the rows where total is zero by `ID` and above blocks
# and transform with size to calculate the number of consecutive zeros
s = df[m].groupby(['ID', b]).transform('size')

# Drop the rows from the original dataframe where
# there are 4 or more consecutive zeros
df = df.drop(s.index[s >= 4])
Answered By: Shubham Sharma

UPDATED:

Here is an updated answer based on OP’s comment indicating that all contiguous zeros in a streak of 4 or more should be dropped (not just the rows beginning when the streak length reaches 4):

d = df.sort_values(['ID','Year_Month']).pipe(
    lambda d:d.assign(isPos=d.Total.gt(0)))[['ID','isPos']]
d['cumPos'] = d.groupby('ID').isPos.cumsum()
longZeroStreak = ( d[~d.isPos].groupby(['ID','cumPos']).isPos
    .transform(lambda s: len(s) >= 4)
    .reindex(index=df.index, fill_value=False) )
res = df[~longZeroStreak]

Explanation:

  • Sort original dataframe by ID, Year_Month, create an isPos column with boolean indicating whether Total is positive, and create a cumPos column containing the cumsum of isPos for each ID group
  • Group contiguous zeros by ID and number of preceding non-zeros and create a boolean Series longZeroStreak indicating whether the length of each such zero streak is >= 4
  • Set the result to be the rows that aren’t part of a streak of 4 or more zeros.

ORIGINAL ANSWER:

This will do what I think your question asks:

d = df.sort_values(['ID','Year_Month']).pipe(
    lambda d:d.assign(isZero=d.Total.eq(0)))[['ID','isZero']]
d['cumZeros'] = d.groupby('ID').isZero.cumsum()
d['cumZerosAtLastBreak'] = ( d.groupby('ID').cumZeros
    .transform(lambda s: s[s.eq(s.shift(1))]) )
d['cumZerosAtLastBreak'] = ( d.groupby('ID').cumZerosAtLastBreak
    .transform(lambda s:s.ffill().fillna(0, downcast='infer')) )
res = df.loc[d.cumZeros - d.cumZerosAtLastBreak < 4,:]

Explanation:

  • Sort original dataframe by ID, Year_Month, add isZero column with boolean indicating whether Total is 0, and drop all columns but ID, isZero
  • Add cumZeros column with cumsum of isZero for each ID group
  • Add cumZerosAtLastBreak column which copies values from cumZeros for rows with cumZeros == cumZeros.shift(1), and otherwise is NaN, for each ID group (this gives us the cum number of zeros, but only for rows marking the break of a zero streak)
  • Update cumZerosAtLastBreak column by using ffill and fillna(0) (and downcast to int, just to be logically consistent) for each ID group
  • Filter original dataframe to keep only rows with cumZeros - cumZerosAtLastBreak < 4 (namely, rows with a zero streak length below 4).

Sample input:

           ID Year_Month  Total
0   AL0010000    2020-01      0
1   AL0010000    2020-02      0
2   AL0010000    2020-03      0
3   AL0010000    2020-04      0
4   AL0010000    2020-05    123
5   AL0010000    2020-06      0
6   AL0010000    2020-07      0
7   AL0010000    2020-08      0
8   AL0010000    2020-09      0
9   AL0010000    2020-10      0
10  AL0010000    2020-11      0
11  AL0010000    2020-12    456
12  WV0550300    2021-01      0
13  WV0550300    2021-02      0
14  WV0550300    2021-03      0
15  WV0550300    2021-04      0
16  WV0550300    2021-05    123
17  WV0550300    2021-06      0
18  WV0550300    2021-07      0
19  WV0550300    2021-08      0
20  WV0550300    2021-09      0
21  WV0550300    2021-10      0
22  WV0550300    2021-11      0
23  WV0550300    2021-12    456

Output:

           ID Year_Month  Total
0   AL0010000    2020-01      0
1   AL0010000    2020-02      0
2   AL0010000    2020-03      0
4   AL0010000    2020-05    123
5   AL0010000    2020-06      0
6   AL0010000    2020-07      0
7   AL0010000    2020-08      0
11  AL0010000    2020-12    456
12  WV0550300    2021-01      0
13  WV0550300    2021-02      0
14  WV0550300    2021-03      0
16  WV0550300    2021-05    123
17  WV0550300    2021-06      0
18  WV0550300    2021-07      0
19  WV0550300    2021-08      0
23  WV0550300    2021-12    456
Answered By: constantstranger
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.