Subset pandas dataframe up to when condition is met the first time
Question:
I have not had any luck accomplishing a task, where I want to subset a pandas dataframe up to a value, and grouping by their id. In the actual dataset I have several columns in between ‘id’ and ‘status’
For example:
d = {'id': [1,1,1,1,1,1,1,2,2,2,2,2,2,2], 'status': [0,0,0,0,1,1,1,0,0,0,0,1,0,1]}
df = pd.DataFrame(data=d)
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 1 1
6 1 1
7 2 0
8 2 0
9 2 0
10 2 0
11 2 1
12 2 0
13 2 1
The desired subset would be:
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 2 1
Answers:
Let’s try groupby
+ cumsum
:
df = df.groupby('id', group_keys=False)
.apply(lambda x: x[x.status.cumsum().cumsum().le(1)])
.reset_index(drop=1)
df
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 2 1
Here’s an alternative that performs a groupby
to create a mask to be used as an indexer:
df = df[df.status.eq(1).groupby(df.id)
.apply(lambda x: x.cumsum().cumsum().le(1))]
.reset_index(drop=1)
df
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 2 1
Is there a faster way to do this as x.cumsum().cumsum().le(1) is taking too much time if I have 70k rows.
I have not had any luck accomplishing a task, where I want to subset a pandas dataframe up to a value, and grouping by their id. In the actual dataset I have several columns in between ‘id’ and ‘status’
For example:
d = {'id': [1,1,1,1,1,1,1,2,2,2,2,2,2,2], 'status': [0,0,0,0,1,1,1,0,0,0,0,1,0,1]}
df = pd.DataFrame(data=d)
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 1 1
6 1 1
7 2 0
8 2 0
9 2 0
10 2 0
11 2 1
12 2 0
13 2 1
The desired subset would be:
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 2 1
Let’s try groupby
+ cumsum
:
df = df.groupby('id', group_keys=False)
.apply(lambda x: x[x.status.cumsum().cumsum().le(1)])
.reset_index(drop=1)
df
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 2 1
Here’s an alternative that performs a groupby
to create a mask to be used as an indexer:
df = df[df.status.eq(1).groupby(df.id)
.apply(lambda x: x.cumsum().cumsum().le(1))]
.reset_index(drop=1)
df
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 2 1
Is there a faster way to do this as x.cumsum().cumsum().le(1) is taking too much time if I have 70k rows.