Delete all rows before the first appearance of a condition in a pandas data frame
Question:
I have the following data frame:
df = pd.DataFrame({"Person":[1,1,2,2,3,3,3,3],
"Bank":["B1","B2","B9","B2","B6","B1","B1","B5",]})
Person Bank
0 1 B1
1 1 B2
2 2 B9
3 2 B2
4 3 B6
5 3 B1
6 3 B1
7 3 B5
I want to drop all the rows of each person that go before the first time B1
appears. That is, I want to keep the rows where Bank == B1
and the following ones.
This is what I want to get:
Person Bank
0 1 B1
1 1 B2
5 3 B1
6 3 B1
7 3 B5
If B1
never happens, then clear all the rows that belong to that person. If there’s rows before the first appearance of B1
, I want to drop them.
Answers:
You can check with transform
s=(df['Bank']=='B1').groupby(df['Person'])
df[(df.index>=(s.transform('idxmax')))&s.transform('any')]
Out[305]:
Person Bank
0 1 B1
1 1 B2
5 3 B1
6 3 B1
7 3 B5
Using mask
+ ffill
m = df['Bank'].where(df['Bank'] == 'B1').groupby(df['Person']).ffill()
df[m.notnull()]
Person Bank
0 1 B1
1 1 B2
5 3 B1
6 3 B1
7 3 B5
This works by making everything after the first occurrence in a group a non-null value. This is done in two steps:
1) Mask everything that isn’t valid.
df['Bank'].where(df['Bank'] == 'B1')
0 B1
1 NaN
2 NaN
3 NaN
4 NaN
5 B1
6 B1
7 NaN
Name: Bank, dtype: object
2) Fill forward per group. This is the real key to the answer. This means that all values after the first occurrence in B1
will be filled with valid strings (per group), so they won’t be removed by notnull
>>> m
0 B1
1 B1
2 NaN
3 NaN
4 NaN
5 B1
6 B1
7 B1
Name: Bank, dtype: object
Once we have the valid mask, it’s trivial to filter the DataFrame where the mask is not null.
Using cumsum
and their bool correspondents (astype(bool)
)
df[df.groupby('Person').Bank.transform(lambda s: s.eq('B1').cumsum().astype(bool))]
Person Bank
0 1 B1
1 1 B2
5 3 B1
6 3 B1
7 3 B5
I have the following data frame:
df = pd.DataFrame({"Person":[1,1,2,2,3,3,3,3],
"Bank":["B1","B2","B9","B2","B6","B1","B1","B5",]})
Person Bank
0 1 B1
1 1 B2
2 2 B9
3 2 B2
4 3 B6
5 3 B1
6 3 B1
7 3 B5
I want to drop all the rows of each person that go before the first time B1
appears. That is, I want to keep the rows where Bank == B1
and the following ones.
This is what I want to get:
Person Bank
0 1 B1
1 1 B2
5 3 B1
6 3 B1
7 3 B5
If B1
never happens, then clear all the rows that belong to that person. If there’s rows before the first appearance of B1
, I want to drop them.
You can check with transform
s=(df['Bank']=='B1').groupby(df['Person'])
df[(df.index>=(s.transform('idxmax')))&s.transform('any')]
Out[305]:
Person Bank
0 1 B1
1 1 B2
5 3 B1
6 3 B1
7 3 B5
Using mask
+ ffill
m = df['Bank'].where(df['Bank'] == 'B1').groupby(df['Person']).ffill()
df[m.notnull()]
Person Bank
0 1 B1
1 1 B2
5 3 B1
6 3 B1
7 3 B5
This works by making everything after the first occurrence in a group a non-null value. This is done in two steps:
1) Mask everything that isn’t valid.
df['Bank'].where(df['Bank'] == 'B1')
0 B1
1 NaN
2 NaN
3 NaN
4 NaN
5 B1
6 B1
7 NaN
Name: Bank, dtype: object
2) Fill forward per group. This is the real key to the answer. This means that all values after the first occurrence in B1
will be filled with valid strings (per group), so they won’t be removed by notnull
>>> m
0 B1
1 B1
2 NaN
3 NaN
4 NaN
5 B1
6 B1
7 B1
Name: Bank, dtype: object
Once we have the valid mask, it’s trivial to filter the DataFrame where the mask is not null.
Using cumsum
and their bool correspondents (astype(bool)
)
df[df.groupby('Person').Bank.transform(lambda s: s.eq('B1').cumsum().astype(bool))]
Person Bank
0 1 B1
1 1 B2
5 3 B1
6 3 B1
7 3 B5