python – efficient way of checking if part of string is in the list
Question:
I have a huge string like:
The Dormouse’s story. Once upon a time there were three little
sisters; and their names were Elsie, Lacie and Tillie; and they lived
at the bottom of a well….badword…
and I have a list of around 400 bad words:
bad_words = ["badword", "badword1", ....]
what is the most efficient way to check if text contains a bad word from badwords list?
I could loop over both text and list like:
for word in huge_string:
for bw in bad_words_list:
if bw in word:
# print "bad word is inside text"...
but this seems to me to be from 90’s..
Update: bad words are single words.
Answers:
No need to get all the words of the text, you can directly check if a string is in another string, e.g.:
In [1]: 'bad word' in 'do not say bad words!'
Out[1]: True
So you can just do:
for bad_word in bad_words_list:
if bad_word in huge_string:
print "BAD!!"
Turning your text into a set of words and computing its intersection with the set of bad words will give you amortized speed:
text = "The Dormouse's story. Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well....badword..."
badwords = set(["badword", "badword1", ....])
textwords = set(word for word in text.split())
for badword in badwords.intersection(textwords):
print("The bad word '{}' was found in the text".format(badword))
something like:
st = set(s.split())
bad_words = ["badword", "badword1"]
any(bad in st for bad in bad_words)
Or if you want the words:
st = set(s.split())
bad_words = {"badword", "badword1"}
print(st.intersection(bad_words))
If you have words like where the sentence ends in a badword.
or badword!
then the set method will fail, you will actually have to go over each word in the string and check if any badword is the same as the word or a substring.
st = s.split()
any(bad in word for word in st for bad in bad_words)
You can use any
:
To test if bad_words are pre/suffixes:
>>> bad_words = ["badword", "badword1"]
>>> text ="some text with badwords or not"
>>> any(i in text for i in bad_words)
True
>>> text ="some text with words or not"
>>> any(i in text for i in bad_words)
False
It will compare any of the bad_words’ item are in text
, using “substring”.
To test exact matches:
>>> text ="some text with badwords or not"
>>> any(i in text.split() for i in bad_words)
False
>>> text ="some text with badword or not"
>>> any(i in text.split() for i in bad_words)
True
It will compare any of the bad_words’ item are in text.split()
, that is, if it’s an exact item.
s
is the long string. use &
operator or set.intersection
method.
In [123]: set(s.split()) & set(bad_words)
Out[123]: {'badword'}
In [124]: bool(set(s.split()) & set(bad_words))
Out[124]: True
Or even better Use set.isdisjoint
.
This will short circuit as soon as match is found.
In [127]: bad_words = set(bad_words)
In [128]: not bad_words.isdisjoint(s.split())
Out[128]: True
In [129]: not bad_words.isdisjoint('for bar spam'.split())
Out[129]: False
s = " a string with bad word"
text = s.split()
if any(bad_word in text for bad_word in ('bad', 'bad2')):
print "bad word found"
On top of all the excellent answers, the for now, whole words
clause in your comment points in the direction of regular expressions.
You may want to build a composed expression like bad|otherbad|yetanother
r = re.compile("|".join(badwords))
r.search(text)
i would use a filter
function:
filter(lambda s : s in bad_words_list, huge_string.split())
There is already a library for that
from better_profanity import profanity
print(profanity.censor("YOUR_TEXT", "#"))
I have a huge string like:
The Dormouse’s story. Once upon a time there were three little
sisters; and their names were Elsie, Lacie and Tillie; and they lived
at the bottom of a well….badword…
and I have a list of around 400 bad words:
bad_words = ["badword", "badword1", ....]
what is the most efficient way to check if text contains a bad word from badwords list?
I could loop over both text and list like:
for word in huge_string:
for bw in bad_words_list:
if bw in word:
# print "bad word is inside text"...
but this seems to me to be from 90’s..
Update: bad words are single words.
No need to get all the words of the text, you can directly check if a string is in another string, e.g.:
In [1]: 'bad word' in 'do not say bad words!'
Out[1]: True
So you can just do:
for bad_word in bad_words_list:
if bad_word in huge_string:
print "BAD!!"
Turning your text into a set of words and computing its intersection with the set of bad words will give you amortized speed:
text = "The Dormouse's story. Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well....badword..."
badwords = set(["badword", "badword1", ....])
textwords = set(word for word in text.split())
for badword in badwords.intersection(textwords):
print("The bad word '{}' was found in the text".format(badword))
something like:
st = set(s.split())
bad_words = ["badword", "badword1"]
any(bad in st for bad in bad_words)
Or if you want the words:
st = set(s.split())
bad_words = {"badword", "badword1"}
print(st.intersection(bad_words))
If you have words like where the sentence ends in a badword.
or badword!
then the set method will fail, you will actually have to go over each word in the string and check if any badword is the same as the word or a substring.
st = s.split()
any(bad in word for word in st for bad in bad_words)
You can use any
:
To test if bad_words are pre/suffixes:
>>> bad_words = ["badword", "badword1"]
>>> text ="some text with badwords or not"
>>> any(i in text for i in bad_words)
True
>>> text ="some text with words or not"
>>> any(i in text for i in bad_words)
False
It will compare any of the bad_words’ item are in text
, using “substring”.
To test exact matches:
>>> text ="some text with badwords or not"
>>> any(i in text.split() for i in bad_words)
False
>>> text ="some text with badword or not"
>>> any(i in text.split() for i in bad_words)
True
It will compare any of the bad_words’ item are in text.split()
, that is, if it’s an exact item.
s
is the long string. use &
operator or set.intersection
method.
In [123]: set(s.split()) & set(bad_words)
Out[123]: {'badword'}
In [124]: bool(set(s.split()) & set(bad_words))
Out[124]: True
Or even better Use set.isdisjoint
.
This will short circuit as soon as match is found.
In [127]: bad_words = set(bad_words)
In [128]: not bad_words.isdisjoint(s.split())
Out[128]: True
In [129]: not bad_words.isdisjoint('for bar spam'.split())
Out[129]: False
s = " a string with bad word"
text = s.split()
if any(bad_word in text for bad_word in ('bad', 'bad2')):
print "bad word found"
On top of all the excellent answers, the for now, whole words
clause in your comment points in the direction of regular expressions.
You may want to build a composed expression like bad|otherbad|yetanother
r = re.compile("|".join(badwords))
r.search(text)
i would use a filter
function:
filter(lambda s : s in bad_words_list, huge_string.split())
There is already a library for that
from better_profanity import profanity
print(profanity.censor("YOUR_TEXT", "#"))