# How to delete rows from a pandas DataFrame based on a conditional expression

## Question:

I have a pandas DataFrame and I want to delete rows from it where the length of the string in a particular column is greater than 2.

I expect to be able to do this (per this answer):

```
df[(len(df['column name']) < 2)]
```

but I just get the error:

```
KeyError: u'no item named False'
```

What am I doing wrong?

(Note: I know I can use `df.dropna()`

to get rid of rows that contain any `NaN`

, but I didn’t see how to remove rows based on a conditional expression.)

## Answers:

When you do `len(df['column name'])`

you are just getting one number, namely the number of rows in the DataFrame (i.e., the length of the column itself). If you want to apply `len`

to each element in the column, use `df['column name'].map(len)`

. So try

```
df[df['column name'].map(len) < 2]
```

To directly answer this question’s original title "How to delete rows from a pandas DataFrame based on a conditional expression" (which I understand is not necessarily the OP’s problem but could help other users coming across this question) one way to do this is to use the drop method:

```
df = df.drop(some labels)
df = df.drop(df[<some boolean condition>].index)
```

**Example**

To remove all rows where column ‘score’ is < 50:

```
df = df.drop(df[df.score < 50].index)
```

In place version (as pointed out in comments)

```
df.drop(df[df.score < 50].index, inplace=True)
```

**Multiple conditions**

(see Boolean Indexing)

The operators are:

`|`

for`or`

,`&`

for`and`

, and`~`

for`not`

. These must be

grouped by using parentheses.

To remove all rows where column ‘score’ is < 50 and > 20

```
df = df.drop(df[(df.score < 50) & (df.score > 20)].index)
```

You can assign the `DataFrame`

to a filtered version of itself:

```
df = df[df.score > 50]
```

This is faster than `drop`

:

```
%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test[test.x < 0]
# 54.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test.drop(test[test.x > 0].index, inplace=True)
# 201 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test.drop(test[test.x > 0].index)
# 194 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

In pandas you can do `str.len`

with your boundary and using the Boolean result to filter it .

```
df[df['column name'].str.len().lt(2)]
```

If you want to drop rows of data frame on the basis of some complicated condition on the column value then writing that in the way shown above can be complicated. I have the following simpler solution which always works. Let us assume that you want to drop the column with ‘header’ so get that column in a list first.

```
text_data = df['name'].tolist()
```

now apply some function on the every element of the list and put that in a panda series:

```
text_length = pd.Series([func(t) for t in text_data])
```

in my case I was just trying to get the number of tokens:

```
text_length = pd.Series([len(t.split()) for t in text_data])
```

now add one extra column with the above series in the data frame:

```
df = df.assign(text_length = text_length .values)
```

now we can apply condition on the new column such as:

```
df = df[df.text_length > 10]
```

```
def pass_filter(df, label, length, pass_type):
text_data = df[label].tolist()
text_length = pd.Series([len(t.split()) for t in text_data])
df = df.assign(text_length = text_length .values)
if pass_type == 'high':
df = df[df.text_length > length]
if pass_type == 'low':
df = df[df.text_length < length]
df = df.drop(columns=['text_length'])
return df
```

I will expand on @User’s generic solution to provide a `drop`

free alternative. This is for folks directed here based on the question’s title (not OP ‘s problem)

Say you want to delete all rows with negative values. One liner solution is:-

```
df = df[(df > 0).all(axis=1)]
```

**Step by step Explanation:–**

**Let’s generate a 5×5 random normal distribution data frame**

```
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'))
A B C D E
0 1.764052 0.400157 0.978738 2.240893 1.867558
1 -0.977278 0.950088 -0.151357 -0.103219 0.410599
2 0.144044 1.454274 0.761038 0.121675 0.443863
3 0.333674 1.494079 -0.205158 0.313068 -0.854096
4 -2.552990 0.653619 0.864436 -0.742165 2.269755
```

**Let the condition be deleting negatives. A boolean df satisfying the condition:-**

```
df > 0
A B C D E
0 True True True True True
1 False True False False True
2 True True True True True
3 True True False True False
4 False True True False True
```

**A boolean series for all rows satisfying the condition** Note if any element in the row fails the condition the row is marked false

```
(df > 0).all(axis=1)
0 True
1 False
2 True
3 False
4 False
dtype: bool
```

**Finally filter out rows from data frame based on the condition**

```
df[(df > 0).all(axis=1)]
A B C D E
0 1.764052 0.400157 0.978738 2.240893 1.867558
2 0.144044 1.454274 0.761038 0.121675 0.443863
```

You can assign it back to df to actually *delete* vs *filter* ing done above

`df = df[(df > 0).all(axis=1)]`

This can easily be extended to filter out rows containing NaN s (non numeric entries):-

`df = df[(~df.isnull()).all(axis=1)]`

This can also be simplified for cases like: Delete all rows where column E is negative

```
df = df[(df.E>0)]
```

I would like to end with some profiling stats on why @User’s `drop`

solution is slower than raw column based filtration:-

```
%timeit df_new = df[(df.E>0)]
345 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dft.drop(dft[dft.E < 0].index, inplace=True)
890 µs ± 94.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

A column is basically a `Series`

i.e a `NumPy`

array, it can be indexed without any cost. For folks interested in how the underlying memory organization plays into execution speed here is a great Link on Speeding up Pandas: