How to use str.contains() with multiple expressions in pandas dataframes
Question:
I’m wondering if there is a more efficient way to use the str.contains()
function in Pandas, to search for two partial strings at once. I want to search a given column in a dataframe for data that contains either "nt" or "nv". Right now, my code looks like this:
df[df['Behavior'].str.contains("nt", na=False)]
df[df['Behavior'].str.contains("nv", na=False)]
And then I append one result to another. What I’d like to do is use a single line of code to search for any data that includes "nt" OR "nv" OR "nf." I’ve played around with some ways that I thought should work, including just sticking a pipe between terms, but all of these result in errors. I’ve checked the documentation, but I don’t see this as an option. I get errors like this:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-113-1d11e906812c> in <module>()
3
4
----> 5 soctol = f_recs[f_recs['Behavior'].str.contains("nt"|"nv", na=False)]
6 soctol
TypeError: unsupported operand type(s) for |: 'str' and 'str'
Is there a fast way to do this?
Answers:
They should be one regular expression, and should be in one string:
"nt|nv" # rather than "nt" | " nv"
f_recs[f_recs['Behavior'].str.contains("nt|nv", na=False)]
Python doesn’t let you use the or (|
) operator on strings:
In [1]: "nt" | "nv"
TypeError: unsupported operand type(s) for |: 'str' and 'str'
I try this one and it’s work:
df[df['Behavior'].str.contains('nt|nv', na=False)]
If you have the patterns in a list, then it might be convenient if you join them by a pipe (|
) and pass it to str.contains
. Return False for NaNs by na=False
and turn off case sensitivity by case=False
.
lst = ['nt', 'nv', 'nf']
df['Behavior'].str.contains('|'.join(lst), na=False)
Otherwise, it might be cleaner to group the alternations. For the example in the OP, that is:
df['Behavior'].str.contains(r'n[t|v|f]')
I’m wondering if there is a more efficient way to use the str.contains()
function in Pandas, to search for two partial strings at once. I want to search a given column in a dataframe for data that contains either "nt" or "nv". Right now, my code looks like this:
df[df['Behavior'].str.contains("nt", na=False)]
df[df['Behavior'].str.contains("nv", na=False)]
And then I append one result to another. What I’d like to do is use a single line of code to search for any data that includes "nt" OR "nv" OR "nf." I’ve played around with some ways that I thought should work, including just sticking a pipe between terms, but all of these result in errors. I’ve checked the documentation, but I don’t see this as an option. I get errors like this:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-113-1d11e906812c> in <module>()
3
4
----> 5 soctol = f_recs[f_recs['Behavior'].str.contains("nt"|"nv", na=False)]
6 soctol
TypeError: unsupported operand type(s) for |: 'str' and 'str'
Is there a fast way to do this?
They should be one regular expression, and should be in one string:
"nt|nv" # rather than "nt" | " nv"
f_recs[f_recs['Behavior'].str.contains("nt|nv", na=False)]
Python doesn’t let you use the or (|
) operator on strings:
In [1]: "nt" | "nv"
TypeError: unsupported operand type(s) for |: 'str' and 'str'
I try this one and it’s work:
df[df['Behavior'].str.contains('nt|nv', na=False)]
If you have the patterns in a list, then it might be convenient if you join them by a pipe (|
) and pass it to str.contains
. Return False for NaNs by na=False
and turn off case sensitivity by case=False
.
lst = ['nt', 'nv', 'nf']
df['Behavior'].str.contains('|'.join(lst), na=False)
Otherwise, it might be cleaner to group the alternations. For the example in the OP, that is:
df['Behavior'].str.contains(r'n[t|v|f]')