Search for String in all Pandas DataFrame columns and filter
Question:
Thought this would be straight forward but had some trouble tracking down an elegant way to search all columns in a dataframe at same time for a partial string match. Basically how would I apply df['col1'].str.contains('^')
to an entire dataframe at once and filter down to any rows that have records containing the match?
Answers:
The Series.str.contains
method expects a regex pattern (by default), not a literal string. Therefore str.contains("^")
matches the beginning of any string. Since every string has a beginning, everything matches. Instead use str.contains("^")
to match the literal ^
character.
To check every column, you could use for col in df
to iterate through the column names, and then call str.contains
on each column:
mask = np.column_stack([df[col].str.contains(r"^", na=False) for col in df])
df.loc[mask.any(axis=1)]
Alternatively, you could pass regex=False
to str.contains
to make the test use the Python in
operator; but (in general) using regex is faster.
Try with :
df.apply(lambda row: row.astype(str).str.contains('TEST').any(), axis=1)
posting my findings in case someone would need.
i had a Dataframe (360 000 rows), needed to search across the whole dataframe to find the rows (just a few) that contained word ‘TOTAL’ (any variation eg ‘TOTAL PRICE’, ‘TOTAL STEMS’ etc) and delete those rows.
i finally processed the dataframe in two-steps:
FIND COLUMNS THAT CONTAIN THE WORD:
for i in df.columns:
df[i].astype('str').apply(lambda x: print(df[i].name) if x.startswith('TOTAL') else 'pass')
DELETE THE ROWS:
df[df['LENGTH/ CMS'].str.contains('TOTAL') != True]
Here’s a function to solve the problem of doing text search in all column of a dataframe df
:
def search(regex: str, df, case=False):
"""Search all the text columns of `df`, return rows with any matches."""
textlikes = df.select_dtypes(include=[object, "string"])
return df[
textlikes.apply(
lambda column: column.str.contains(regex, regex=True, case=case, na=False)
).any(axis=1)
]
It differs from the existing answers by both staying in the pandas API and embracing that pandas is more efficient in column processing than row processing. Also, this is packed as a pure function 🙂
Relevant docs:
Here is an example using applymap. I found other answers didn’t work for me since they assumed that all data in a column would be strings causing Attribute Errors. Also it is surprisingly fast.
def search(dataFrame, item):
mask = (dataFrame.applymap(lambda x: isinstance(x, str) and item in x)).any(1)
return dataFrame[mask]
You can easily change the lambda to use regex if needed.
Yet another solution. This selects for columns of type object
, which is Panda’s type for strings. Other solutions that coerce to str with .astype(str)
could give false positives if you’re searching for a number (and want to exclude numeric columns and only search in strings — but if you want to include searching numeric columns it may be the better approach).
As an added benefit, filtering the columns in this way seems to have a performance benefit; on my dataframe of shape (15807, 35)
, with only 17 of those 35 being strings, I see 4.74 s ± 108 ms per loop
as compared to 5.72 s ± 155 ms
.
df[
df.select_dtypes(object)
.apply(lambda row: row.str.contains("with"), axis=1)
.any(axis=1)
]
Building on top of @unutbu’s answer https://stackoverflow.com/a/26641085/2839786
I use something like this:
>>> import pandas as pd
>>> import numpy as np
>>>
>>> def search(df: pd.DataFrame, substring: str, case: bool = False) -> pd.DataFrame:
... mask = np.column_stack([df[col].astype(str).str.contains(substring.lower(), case=case, na=False) for col in df])
... return df.loc[mask.any(axis=1)]
>>>
>>> # test
>>> df = pd.DataFrame({'col1':['hello', 'world', 'Sun'], 'col2': ['today', 'sunny', 'foo'], 'col3': ['WORLD', 'NEWS', 'bar']})
>>> df
col1 col2 col3
0 hello today WORLD
1 world sunny NEWS
2 Sun foo bar
>>>
>>> search(df, 'sun')
col1 col2 col3
1 world sunny NEWS
2 Sun foo bar
Thought this would be straight forward but had some trouble tracking down an elegant way to search all columns in a dataframe at same time for a partial string match. Basically how would I apply df['col1'].str.contains('^')
to an entire dataframe at once and filter down to any rows that have records containing the match?
The Series.str.contains
method expects a regex pattern (by default), not a literal string. Therefore str.contains("^")
matches the beginning of any string. Since every string has a beginning, everything matches. Instead use str.contains("^")
to match the literal ^
character.
To check every column, you could use for col in df
to iterate through the column names, and then call str.contains
on each column:
mask = np.column_stack([df[col].str.contains(r"^", na=False) for col in df])
df.loc[mask.any(axis=1)]
Alternatively, you could pass regex=False
to str.contains
to make the test use the Python in
operator; but (in general) using regex is faster.
Try with :
df.apply(lambda row: row.astype(str).str.contains('TEST').any(), axis=1)
posting my findings in case someone would need.
i had a Dataframe (360 000 rows), needed to search across the whole dataframe to find the rows (just a few) that contained word ‘TOTAL’ (any variation eg ‘TOTAL PRICE’, ‘TOTAL STEMS’ etc) and delete those rows.
i finally processed the dataframe in two-steps:
FIND COLUMNS THAT CONTAIN THE WORD:
for i in df.columns:
df[i].astype('str').apply(lambda x: print(df[i].name) if x.startswith('TOTAL') else 'pass')
DELETE THE ROWS:
df[df['LENGTH/ CMS'].str.contains('TOTAL') != True]
Here’s a function to solve the problem of doing text search in all column of a dataframe df
:
def search(regex: str, df, case=False):
"""Search all the text columns of `df`, return rows with any matches."""
textlikes = df.select_dtypes(include=[object, "string"])
return df[
textlikes.apply(
lambda column: column.str.contains(regex, regex=True, case=case, na=False)
).any(axis=1)
]
It differs from the existing answers by both staying in the pandas API and embracing that pandas is more efficient in column processing than row processing. Also, this is packed as a pure function 🙂
Relevant docs:
Here is an example using applymap. I found other answers didn’t work for me since they assumed that all data in a column would be strings causing Attribute Errors. Also it is surprisingly fast.
def search(dataFrame, item):
mask = (dataFrame.applymap(lambda x: isinstance(x, str) and item in x)).any(1)
return dataFrame[mask]
You can easily change the lambda to use regex if needed.
Yet another solution. This selects for columns of type object
, which is Panda’s type for strings. Other solutions that coerce to str with .astype(str)
could give false positives if you’re searching for a number (and want to exclude numeric columns and only search in strings — but if you want to include searching numeric columns it may be the better approach).
As an added benefit, filtering the columns in this way seems to have a performance benefit; on my dataframe of shape (15807, 35)
, with only 17 of those 35 being strings, I see 4.74 s ± 108 ms per loop
as compared to 5.72 s ± 155 ms
.
df[
df.select_dtypes(object)
.apply(lambda row: row.str.contains("with"), axis=1)
.any(axis=1)
]
Building on top of @unutbu’s answer https://stackoverflow.com/a/26641085/2839786
I use something like this:
>>> import pandas as pd
>>> import numpy as np
>>>
>>> def search(df: pd.DataFrame, substring: str, case: bool = False) -> pd.DataFrame:
... mask = np.column_stack([df[col].astype(str).str.contains(substring.lower(), case=case, na=False) for col in df])
... return df.loc[mask.any(axis=1)]
>>>
>>> # test
>>> df = pd.DataFrame({'col1':['hello', 'world', 'Sun'], 'col2': ['today', 'sunny', 'foo'], 'col3': ['WORLD', 'NEWS', 'bar']})
>>> df
col1 col2 col3
0 hello today WORLD
1 world sunny NEWS
2 Sun foo bar
>>>
>>> search(df, 'sun')
col1 col2 col3
1 world sunny NEWS
2 Sun foo bar