Pandas drop rows with value less than a given value

Question:

I would like to delete rows that contain only values that are less than 10 and greater than 25. My sample dataframe will look like this:

a   b   c  
1   2   3  
4   5   16  
11  24  22  
26  50  65  

Expected Output:

a   b   c  
1   2   3  
4   5   16   
26  50  65  

So if the row contains any value less than 10 or greater than 25, then the row will stay in dataframe, otherwise, it needs to be dropped.

Is there any way I can achieve this with Pandas instead of iterating through all the rows?

Asked By: Jaswanth Kumar

||

Answers:

You can call apply and return the results to a new column called ‘Keep’. You can then use this column to drop rows that you don’t need.

import pandas as pd
l = [[1,2,3],[4,5,6],[11,24,22],[26,50,65]]
df = pd.DataFrame(l, columns = ['a','b','c']) #Set up sample dataFrame

df['keep'] = df.apply(lambda row: sum(any([(x < 10) or (x > 25) for x in row])), axis = 1)

The any() function returns a generator. Calling sum(generator) simply returns the sum of all the results stored in the generator.

Check this on how any() works.
Apply function still iterates over all the rows like a for loop, but the code looks cleaner this way. I cannot think of a way to do this without iterating over all the rows.

Output:

    a   b   c  keep
0   1   2   3     1
1   4   5   6     1
2  11  24  22     0
3  26  50  65     1


df = df[df['keep'] == 1] #Drop unwanted rows
Answered By: Rakesh Adhikesavan

You can use pandas boolean indexing

dropped_df = df.loc[((df<10) | (df>25)).any(1)]
  • df<10 will return a boolean df
  • | is the OR operator
  • .any(1) returns any true element over the axis 1 (rows) see documentation
  • df.loc[] then filters the dataframe based on the boolean df
Answered By: Prageeth Jayathissa

I really like using masking for stuff like this; it’s clean so you can go back and read your code. It’s faster than using .apply too which is effectively for looping. Also, it avoids setting by copy warnings.

This uses boolean indexing like Prageeth’s answer. But the difference is I like how you can save the boolean index as a separate variable for re-use later. I often do that so I don’t have to modify the original dataframe or create a new one and just use df[mask] wherever I want that cropped view of the dataframe.

df = pd.DataFrame(
    [[1,2,3],
    [4,5,16],
    [11,24,22],
    [26,50,65]],
    columns=['a','b','c']
)

#use a mask to create a fully indexed boolean dataframe, 
#which avoids the SettingWithCopyWarning:
#https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
mask = (df > 10) & (df < 25)
print(mask)
"""
       a      b      c
0  False  False  False
1  False  False   True
2   True   True   True
3  False  False  False
"""

print(df[mask])
"""
      a     b     c
0   NaN   NaN   NaN
1   NaN   NaN  16.0
2  11.0  24.0  22.0
3   NaN   NaN   NaN
"""

print(df[mask].dropna())
"""
      a     b     c
2  11.0  24.0  22.0
"""

#one neat things about using masking is you can invert them too with a '~'
print(~mask)
"""
       a      b      c
0   True   True   True
1   True   True  False
2  False  False  False
3   True   True   True
"""

print( df[~mask].dropna())
"""
      a     b     c
0   1.0   2.0   3.0
3  26.0  50.0  65.0
"""

#you can also combine masks
mask2 = mask & (df < 24)
print(mask2)
"""
       a      b      c
0  False  False  False
1  False  False   True
2   True  False  False
3  False  False  False
"""

#and the resulting dataframe (without dropping the rows that are nan or contain any false mask)
print(df[mask2])
"""
      a   b     c
0   NaN NaN   NaN
1   NaN NaN  16.0
2  11.0 NaN  22.0
3   NaN NaN   NaN
"""
Answered By: kilozulu
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.