Search for "does-not-contain" on a DataFrame in pandas

Question:

I’ve done some searching and can’t figure out how to filter a dataframe by

df["col"].str.contains(word)

however I’m wondering if there is a way to do the reverse: filter a dataframe by that set’s compliment. eg: to the effect of

!(df["col"].str.contains(word))

Can this be done through a DataFrame method?

Asked By: stites

||

Answers:

You can use the invert (~) operator (which acts like a not for boolean data):

new_df = df[~df["col"].str.contains(word)]

where new_df is the copy returned by RHS.

contains also accepts a regular expression…


If the above throws a ValueError or TypeError, the reason is likely because you have mixed datatypes, so use na=False:

new_df = df[~df["col"].str.contains(word, na=False)]

Or,

new_df = df[df["col"].str.contains(word) == False]
Answered By: Andy Hayden

I had to get rid of the NULL values before using the command recommended by Andy above. An example:

df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df

    first   second  third
0   myword  myword   NaN
1   myword  NaN      myword 
2   myword  myword   NaN

Now running the command:

~df["second"].str.contains(word)

I get the following error:

TypeError: bad operand type for unary ~: 'float'

I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.

Answered By: Shoresh

I was having trouble with the not (~) symbol as well, so here’s another way from another StackOverflow thread:

df[df["col"].str.contains('this|that')==False]
Answered By: nanselm2

Additional to nanselm2’s answer, you can use 0 instead of False:

df["col"].str.contains(word)==0
Answered By: U13-Forward

You can use Apply and Lambda :

df[df["col"].apply(lambda x: word not in x)]

Or if you want to define more complex rule, you can use AND:

df[df["col"].apply(lambda x: word_1 not in x and word_2 not in x)]
Answered By: Arash

I hope the answers are already posted

I am adding the framework to find multiple words and negate those from dataFrame.

Here 'word1','word2','word3','word4' = list of patterns to search

df = DataFrame

column_a = A column name from DataFrame df

values_to_remove = ['word1','word2','word3','word4'] 

pattern = '|'.join(values_to_remove)

result = df.loc[~df['column_a'].str.contains(pattern, case=False)]
Answered By: Nursnaaz

To compliment to the above question, if someone wants to remove all the rows with strings, one could do:

df_new=df[~df['col_name'].apply(lambda x: isinstance(x, str))]
Answered By: vasanth

To negate your query use ~. Using query has the advantage of returning the valid observations of df directly:

df.query('~col.str.contains("word").values')
Answered By: rachwa

somehow ‘.contains’ didn’t work for me but when I tried with ‘.isin’ as mentioned by @kenan in the answer (How to drop rows from pandas data frame that contains a particular string in a particular column?) it works. Adding further, if you want to look at the entire dataframe and remove those rows which has the specific word (or set of words) just use the loop below

for col in df.columns:
    df = df[~df[col].isin(['string or string list separeted by comma'])]

just remove ~ to get the dataframe that contains the word

Answered By: Bhanu Chander

To add clarity to the top answer, the general pattern for filtering all columns that contain a specific word is:

# Remove any column with "word" in the name
new_df = df.loc[:, ~df.columns.str.contains("word")]

# Filter multiple words
new_df = df.loc[:, ~df.columns.str.contains("word1|word2")]
Answered By: Kyle Bennison
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.