python – efficient way of checking if part of string is in the list

Question:

I have a huge string like:

The Dormouse’s story. Once upon a time there were three little
sisters; and their names were Elsie, Lacie and Tillie; and they lived
at the bottom of a well….badword…

and I have a list of around 400 bad words:

bad_words = ["badword", "badword1", ....]

what is the most efficient way to check if text contains a bad word from badwords list?

I could loop over both text and list like:

for word in huge_string:
   for bw in bad_words_list: 
    if bw in word: 
       # print "bad word is inside text"... 

but this seems to me to be from 90’s..

Update: bad words are single words.

Asked By: doniyor

||

Answers:

No need to get all the words of the text, you can directly check if a string is in another string, e.g.:

In [1]: 'bad word' in 'do not say bad words!'
Out[1]: True

So you can just do:

for bad_word in bad_words_list:
    if bad_word in huge_string:
        print "BAD!!"
Answered By: LeartS

Turning your text into a set of words and computing its intersection with the set of bad words will give you amortized speed:

text  = "The Dormouse's story. Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well....badword..."

badwords = set(["badword", "badword1", ....])

textwords = set(word for word in text.split())
for badword in badwords.intersection(textwords):
    print("The bad word '{}' was found in the text".format(badword))
Answered By: inspectorG4dget

something like:

st = set(s.split())

bad_words = ["badword", "badword1"]
any(bad in st for bad in bad_words)

Or if you want the words:

st = set(s.split())

bad_words = {"badword", "badword1"}
print(st.intersection(bad_words))

If you have words like where the sentence ends in a badword. or badword! then the set method will fail, you will actually have to go over each word in the string and check if any badword is the same as the word or a substring.

st = s.split()
any(bad in word for word in st for bad in bad_words)
Answered By: Padraic Cunningham

You can use any:

To test if bad_words are pre/suffixes:

>>> bad_words = ["badword", "badword1"]
>>> text ="some text with badwords or not"
>>> any(i in text for i in bad_words)
True
>>> text ="some text with words or not"
>>> any(i in text for i in bad_words)
False

It will compare any of the bad_words’ item are in text, using “substring”.

To test exact matches:

>>> text ="some text with badwords or not"
>>> any(i in text.split() for i in bad_words)
False
>>> text ="some text with badword or not"
>>> any(i in text.split() for i in bad_words)
True

It will compare any of the bad_words’ item are in text.split(), that is, if it’s an exact item.

Answered By: fredtantini

s is the long string. use & operator or set.intersection method.

In [123]: set(s.split()) & set(bad_words)
Out[123]: {'badword'}

In [124]: bool(set(s.split()) & set(bad_words))
Out[124]: True

Or even better Use set.isdisjoint.
This will short circuit as soon as match is found.

In [127]: bad_words = set(bad_words)

In [128]: not bad_words.isdisjoint(s.split())
Out[128]: True

In [129]: not bad_words.isdisjoint('for bar spam'.split())
Out[129]: False
Answered By: Vishnu Upadhyay
s = " a string with bad word"
text = s.split()

if any(bad_word in text for bad_word in ('bad', 'bad2')):
        print "bad word found"
Answered By: A.Kareem

On top of all the excellent answers, the for now, whole words clause in your comment points in the direction of regular expressions.

You may want to build a composed expression like bad|otherbad|yetanother

r = re.compile("|".join(badwords))
r.search(text)
Answered By: xtofl

i would use a filter function:

filter(lambda s : s in bad_words_list, huge_string.split())
Answered By: Riccardo

There is already a library for that

from better_profanity import profanity
print(profanity.censor("YOUR_TEXT", "#"))
Answered By: Andrew Hernandez
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.