Pandas drop rows with value less than a given value
Question:
I would like to delete rows that contain only values that are less than 10 and greater than 25. My sample dataframe will look like this:
a b c
1 2 3
4 5 16
11 24 22
26 50 65
Expected Output:
a b c
1 2 3
4 5 16
26 50 65
So if the row contains any value less than 10 or greater than 25, then the row will stay in dataframe, otherwise, it needs to be dropped.
Is there any way I can achieve this with Pandas instead of iterating through all the rows?
Answers:
You can call apply and return the results to a new column called ‘Keep’. You can then use this column to drop rows that you don’t need.
import pandas as pd
l = [[1,2,3],[4,5,6],[11,24,22],[26,50,65]]
df = pd.DataFrame(l, columns = ['a','b','c']) #Set up sample dataFrame
df['keep'] = df.apply(lambda row: sum(any([(x < 10) or (x > 25) for x in row])), axis = 1)
The any()
function returns a generator. Calling sum(generator)
simply returns the sum of all the results stored in the generator.
Check this on how any()
works.
Apply function still iterates over all the rows like a for loop, but the code looks cleaner this way. I cannot think of a way to do this without iterating over all the rows.
Output:
a b c keep
0 1 2 3 1
1 4 5 6 1
2 11 24 22 0
3 26 50 65 1
df = df[df['keep'] == 1] #Drop unwanted rows
You can use pandas boolean indexing
dropped_df = df.loc[((df<10) | (df>25)).any(1)]
df<10
will return a boolean df
|
is the OR operator
.any(1)
returns any true element over the axis 1 (rows) see documentation
df.loc[]
then filters the dataframe based on the boolean df
I really like using masking for stuff like this; it’s clean so you can go back and read your code. It’s faster than using .apply
too which is effectively for looping. Also, it avoids setting by copy warnings.
This uses boolean indexing like Prageeth’s answer. But the difference is I like how you can save the boolean index as a separate variable for re-use later. I often do that so I don’t have to modify the original dataframe or create a new one and just use df[mask]
wherever I want that cropped view of the dataframe.
df = pd.DataFrame(
[[1,2,3],
[4,5,16],
[11,24,22],
[26,50,65]],
columns=['a','b','c']
)
#use a mask to create a fully indexed boolean dataframe,
#which avoids the SettingWithCopyWarning:
#https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
mask = (df > 10) & (df < 25)
print(mask)
"""
a b c
0 False False False
1 False False True
2 True True True
3 False False False
"""
print(df[mask])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 24.0 22.0
3 NaN NaN NaN
"""
print(df[mask].dropna())
"""
a b c
2 11.0 24.0 22.0
"""
#one neat things about using masking is you can invert them too with a '~'
print(~mask)
"""
a b c
0 True True True
1 True True False
2 False False False
3 True True True
"""
print( df[~mask].dropna())
"""
a b c
0 1.0 2.0 3.0
3 26.0 50.0 65.0
"""
#you can also combine masks
mask2 = mask & (df < 24)
print(mask2)
"""
a b c
0 False False False
1 False False True
2 True False False
3 False False False
"""
#and the resulting dataframe (without dropping the rows that are nan or contain any false mask)
print(df[mask2])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 NaN 22.0
3 NaN NaN NaN
"""
I would like to delete rows that contain only values that are less than 10 and greater than 25. My sample dataframe will look like this:
a b c
1 2 3
4 5 16
11 24 22
26 50 65
Expected Output:
a b c
1 2 3
4 5 16
26 50 65
So if the row contains any value less than 10 or greater than 25, then the row will stay in dataframe, otherwise, it needs to be dropped.
Is there any way I can achieve this with Pandas instead of iterating through all the rows?
You can call apply and return the results to a new column called ‘Keep’. You can then use this column to drop rows that you don’t need.
import pandas as pd
l = [[1,2,3],[4,5,6],[11,24,22],[26,50,65]]
df = pd.DataFrame(l, columns = ['a','b','c']) #Set up sample dataFrame
df['keep'] = df.apply(lambda row: sum(any([(x < 10) or (x > 25) for x in row])), axis = 1)
The any()
function returns a generator. Calling sum(generator)
simply returns the sum of all the results stored in the generator.
Check this on how any()
works.
Apply function still iterates over all the rows like a for loop, but the code looks cleaner this way. I cannot think of a way to do this without iterating over all the rows.
Output:
a b c keep
0 1 2 3 1
1 4 5 6 1
2 11 24 22 0
3 26 50 65 1
df = df[df['keep'] == 1] #Drop unwanted rows
You can use pandas boolean indexing
dropped_df = df.loc[((df<10) | (df>25)).any(1)]
df<10
will return a boolean df|
is the OR operator.any(1)
returns any true element over the axis 1 (rows) see documentationdf.loc[]
then filters the dataframe based on the boolean df
I really like using masking for stuff like this; it’s clean so you can go back and read your code. It’s faster than using .apply
too which is effectively for looping. Also, it avoids setting by copy warnings.
This uses boolean indexing like Prageeth’s answer. But the difference is I like how you can save the boolean index as a separate variable for re-use later. I often do that so I don’t have to modify the original dataframe or create a new one and just use df[mask]
wherever I want that cropped view of the dataframe.
df = pd.DataFrame(
[[1,2,3],
[4,5,16],
[11,24,22],
[26,50,65]],
columns=['a','b','c']
)
#use a mask to create a fully indexed boolean dataframe,
#which avoids the SettingWithCopyWarning:
#https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
mask = (df > 10) & (df < 25)
print(mask)
"""
a b c
0 False False False
1 False False True
2 True True True
3 False False False
"""
print(df[mask])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 24.0 22.0
3 NaN NaN NaN
"""
print(df[mask].dropna())
"""
a b c
2 11.0 24.0 22.0
"""
#one neat things about using masking is you can invert them too with a '~'
print(~mask)
"""
a b c
0 True True True
1 True True False
2 False False False
3 True True True
"""
print( df[~mask].dropna())
"""
a b c
0 1.0 2.0 3.0
3 26.0 50.0 65.0
"""
#you can also combine masks
mask2 = mask & (df < 24)
print(mask2)
"""
a b c
0 False False False
1 False False True
2 True False False
3 False False False
"""
#and the resulting dataframe (without dropping the rows that are nan or contain any false mask)
print(df[mask2])
"""
a b c
0 NaN NaN NaN
1 NaN NaN 16.0
2 11.0 NaN 22.0
3 NaN NaN NaN
"""