Function that retuns a dataframe without leading 0s of a specific column
Question:
I have the following dataframe:
df=pd.DataFrame({
'n' : [0,1,2,3, 0,1,2, 0,1,2],
'col1' : ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C'],
'col2' : [0, 0, 0, 0, 3.3, 0, 4, 1.94, 0, 6.17]
})
It has the form:
n col1 col2
0 0 A 0.00
1 1 A 0.00
2 2 A 0.00
3 3 B 0.00
4 0 B 3.30
5 1 B 0.00
6 2 B 4.00
7 0 C 1.94
8 1 C 0.00
9 2 C 6.17
I want a function that will have that dataframe as argument and will return a new dataframe without the first rows where values are 0s in the column ‘col2’
My code
def remove_lead_zeros(df):
new_df = df[df['col2'] != 0]
return new_df
My function removes all rows having 0.0 values while I want to remove only the all first ones,
Goal
Is to get the following dataframe as result:
n col1 col2
0 0 B 3.30
1 1 B 0.00
2 2 B 4.00
3 0 C 1.94
4 1 C 0.00
5 2 C 6.17
Any help from your side will be highly appreciated (Upvoting all answers), thank you !
Answers:
Use groupby.cummax
on the boolean series of non-zero col2 values and boolean indexing:
out = df[df['col2'].ne(0).groupby(df['col1']).cummax()]
Output:
n col1 col2
4 0 B 3.30
5 1 B 0.00
6 2 B 4.00
7 0 C 1.94
8 1 C 0.00
9 2 C 6.17
Intermediates to understand the logic:
n col1 col2 ne(0) groupby.cummax
0 0 A 0.00 False False
1 1 A 0.00 False False
2 2 A 0.00 False False
3 3 B 0.00 False False
4 0 B 3.30 True True
5 1 B 0.00 False True
6 2 B 4.00 True True
7 0 C 1.94 True True
8 1 C 0.00 False True
9 2 C 6.17 True True
You can use cumsum
:
>>> df[df.groupby('col1')['col2'].cumsum().ne(0)]
n col1 col2
4 0 B 3.30
5 1 B 0.00
6 2 B 4.00
7 0 C 1.94
8 1 C 0.00
9 2 C 6.17
While the sum is 0, it means there are leading zeroes.
>>> pd.concat([df, df.groupby('col1')['col2'].cumsum()], axis=1)
n col1 col2 col2
0 0 A 0.00 0.00 # remove
1 1 A 0.00 0.00 # remove
2 2 A 0.00 0.00 # remove
3 3 B 0.00 0.00 # remove
4 0 B 3.30 3.30 # keep
5 1 B 0.00 3.30 # keep
6 2 B 4.00 7.30 # keep
7 0 C 1.94 1.94 # keep
8 1 C 0.00 1.94 # keep
9 2 C 6.17 8.11 # keep
First, get a boolean array of where col2
is not 0, and then use cumulative max, to get a mask that you can apply to your dataframe. Then you need to reset the index, and you get what you want
result = df[(df["col2"] != 0).cummax()].reset_index(drop=True)
where result
looks like
n col1 col2
0 0 B 3.30
1 1 B 0.00
2 2 B 4.00
3 0 C 1.94
4 1 C 0.00
5 2 C 6.17
I have the following dataframe:
df=pd.DataFrame({
'n' : [0,1,2,3, 0,1,2, 0,1,2],
'col1' : ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C'],
'col2' : [0, 0, 0, 0, 3.3, 0, 4, 1.94, 0, 6.17]
})
It has the form:
n col1 col2
0 0 A 0.00
1 1 A 0.00
2 2 A 0.00
3 3 B 0.00
4 0 B 3.30
5 1 B 0.00
6 2 B 4.00
7 0 C 1.94
8 1 C 0.00
9 2 C 6.17
I want a function that will have that dataframe as argument and will return a new dataframe without the first rows where values are 0s in the column ‘col2’
My code
def remove_lead_zeros(df):
new_df = df[df['col2'] != 0]
return new_df
My function removes all rows having 0.0 values while I want to remove only the all first ones,
Goal
Is to get the following dataframe as result:
n col1 col2
0 0 B 3.30
1 1 B 0.00
2 2 B 4.00
3 0 C 1.94
4 1 C 0.00
5 2 C 6.17
Any help from your side will be highly appreciated (Upvoting all answers), thank you !
Use groupby.cummax
on the boolean series of non-zero col2 values and boolean indexing:
out = df[df['col2'].ne(0).groupby(df['col1']).cummax()]
Output:
n col1 col2
4 0 B 3.30
5 1 B 0.00
6 2 B 4.00
7 0 C 1.94
8 1 C 0.00
9 2 C 6.17
Intermediates to understand the logic:
n col1 col2 ne(0) groupby.cummax
0 0 A 0.00 False False
1 1 A 0.00 False False
2 2 A 0.00 False False
3 3 B 0.00 False False
4 0 B 3.30 True True
5 1 B 0.00 False True
6 2 B 4.00 True True
7 0 C 1.94 True True
8 1 C 0.00 False True
9 2 C 6.17 True True
You can use cumsum
:
>>> df[df.groupby('col1')['col2'].cumsum().ne(0)]
n col1 col2
4 0 B 3.30
5 1 B 0.00
6 2 B 4.00
7 0 C 1.94
8 1 C 0.00
9 2 C 6.17
While the sum is 0, it means there are leading zeroes.
>>> pd.concat([df, df.groupby('col1')['col2'].cumsum()], axis=1)
n col1 col2 col2
0 0 A 0.00 0.00 # remove
1 1 A 0.00 0.00 # remove
2 2 A 0.00 0.00 # remove
3 3 B 0.00 0.00 # remove
4 0 B 3.30 3.30 # keep
5 1 B 0.00 3.30 # keep
6 2 B 4.00 7.30 # keep
7 0 C 1.94 1.94 # keep
8 1 C 0.00 1.94 # keep
9 2 C 6.17 8.11 # keep
First, get a boolean array of where col2
is not 0, and then use cumulative max, to get a mask that you can apply to your dataframe. Then you need to reset the index, and you get what you want
result = df[(df["col2"] != 0).cummax()].reset_index(drop=True)
where result
looks like
n col1 col2
0 0 B 3.30
1 1 B 0.00
2 2 B 4.00
3 0 C 1.94
4 1 C 0.00
5 2 C 6.17