Total count of strings within range in dataframe
Question:
I have a dataframe where I want to count the total number of occurrences of the word Yes
, as it appears between a range of rows—Dir
—and then add that count as a new column.
Type,Problem
Parent,
Dir,Yes
File,
Opp,Yes
Dir,
Metadata,
Subfolder,Yes
Dir,
Opp,Yes
So whenever the word Yes
appears in the Problem
column between two Dir
rows, I need a count to then appear next to the Dir
at the beginning of the range.
Expected output would be:
Type Problem yes_count
Parent
Dir Yes 2
File
Opp Yes
Dir 1
Metadata
Subfolder Yes
Dir 1
Opp Yes
I could do something like yes_count = df['Problem'].str.count('Yes').sum()
to get part of the way there. But how do I also account for the range?
Answers:
Use:
# is the row a "Yes"?
m1 = df['Problem'].eq('Yes')
# is the row a "Dir"?
m2 = df['Type'].eq('Dir')
# form groups starting on each "Dir"
g = m1.groupby(m2.cumsum())
# count the number of "Yes" per group
# assign only on "Dir"
df['yes_count'] = g.transform('sum').where(m2)
Output:
Type Problem yes_count
0 Parent NaN NaN
1 Dir Yes 2.0
2 File NaN NaN
3 Opp Yes NaN
4 Dir NaN 1.0
5 Metadata NaN NaN
6 Subfolder Yes NaN
7 Dir NaN 1.0
8 Opp Yes NaN
I have a dataframe where I want to count the total number of occurrences of the word Yes
, as it appears between a range of rows—Dir
—and then add that count as a new column.
Type,Problem
Parent,
Dir,Yes
File,
Opp,Yes
Dir,
Metadata,
Subfolder,Yes
Dir,
Opp,Yes
So whenever the word Yes
appears in the Problem
column between two Dir
rows, I need a count to then appear next to the Dir
at the beginning of the range.
Expected output would be:
Type Problem yes_count
Parent
Dir Yes 2
File
Opp Yes
Dir 1
Metadata
Subfolder Yes
Dir 1
Opp Yes
I could do something like yes_count = df['Problem'].str.count('Yes').sum()
to get part of the way there. But how do I also account for the range?
Use:
# is the row a "Yes"?
m1 = df['Problem'].eq('Yes')
# is the row a "Dir"?
m2 = df['Type'].eq('Dir')
# form groups starting on each "Dir"
g = m1.groupby(m2.cumsum())
# count the number of "Yes" per group
# assign only on "Dir"
df['yes_count'] = g.transform('sum').where(m2)
Output:
Type Problem yes_count
0 Parent NaN NaN
1 Dir Yes 2.0
2 File NaN NaN
3 Opp Yes NaN
4 Dir NaN 1.0
5 Metadata NaN NaN
6 Subfolder Yes NaN
7 Dir NaN 1.0
8 Opp Yes NaN