Removing list of words from a string

Question:

I have a list of stopwords. And I have a search string. I want to remove the words from the string.

As an example:

stopwords=['what','who','is','a','at','is','he']
query='What is hello'

Now the code should strip ‘What’ and ‘is’. However in my case it strips ‘a’, as well as ‘at’. I have given my code below. What could I be doing wrong?

for word in stopwords:
    if word in query:
        print word
        query=query.replace(word,"")

If the input query is “What is Hello”, I get the output as:
wht s llo

Why does this happen?

Asked By: Rohit Shinde

||

Answers:

building on what karthikr said, try

' '.join(filter(lambda x: x.lower() not in stopwords,  query.split()))

explanation:

query.split() #splits variable query on character ' ', e.i. "What is hello" -> ["What","is","hello"]

filter(func,iterable) #takes in a function and an iterable (list/string/etc..) and
                      # filters it based on the function which will take in one item at
                      # a time and return true.false

lambda x: x.lower() not in stopwords   # anonymous function that takes in variable,
                                       # converts it to lower case, and returns true if
                                       # the word is not in the iterable stopwords


' '.join(iterable) #joins all items of the iterable (items must be strings/chars)
                   #using the string/char in front of the dot, i.e. ' ' as a joiner.
                   # i.e. ["What", "is","hello"] -> "What is hello"
Answered By: pseudonym

This is one way to do it:

query = 'What is hello'
stopwords = ['what', 'who', 'is', 'a', 'at', 'is', 'he']
querywords = query.split()

resultwords  = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)

print(result)

I noticed that you want to also remove a word if its lower-case variant is in the list, so I’ve added a call to lower() in the condition check.

Answered By: Robby Cornelissen

Looking at the other answers to your question I noticed that they told you how to do what you are trying to do, but they did not answer the question you posed at the end.

If the input query is “What is Hello”, I get the output as:

wht s llo

Why does this happen?

This happens because .replace() replaces the substring you give it exactly.

for example:

"My, my! Hello my friendly mystery".replace("my", "")

gives:

>>> "My, ! Hello  friendly stery"

.replace() is essentially splitting the string by the substring given as the first parameter and joining it back together with the second parameter.

"hello".replace("he", "je")

is logically similar to:

"je".join("hello".split("he"))

If you were still wanting to use .replace to remove whole words you might think adding a space before and after would be enough, but this leaves out words at the beginning and end of the string as well as punctuated versions of the substring.

"My, my! hello my friendly mystery".replace(" my ", " ")
>>> "My, my! hello friendly mystery"

"My, my! hello my friendly mystery".replace(" my", "")
>>> "My,! hello friendlystery"

"My, my! hello my friendly mystery".replace("my ", "")
>>> "My, my! hello friendly mystery"

Additionally, adding spaces before and after will not catch duplicates as it has already processed the first sub-string and will ignore it in favor of continuing on:

"hello my my friend".replace(" my ", " ")
>>> "hello my friend"

For these reasons your accepted answer by Robby Cornelissen is the recommended way to do what you are wanting.

Answered By: B.Adler

the accepted answer works when provided a list of words separated by spaces, but that’s not the case in real life when there can be punctuation to separate the words. In that case re.split is required.

Also, testing against stopwords as a set makes lookup faster (even if there’s a tradeoff between string hashing & lookup when there’s a small number of words)

My proposal:

import re

query = 'What is hello? Says Who?'
stopwords = {'what','who','is','a','at','is','he'}

resultwords  = [word for word in re.split("W+",query) if word.lower() not in stopwords]
print(resultwords)

output (as list of words):

['hello','Says','']

There’s a blank string in the end, because re.split annoyingly issues blank fields, that needs filtering out. 2 solutions here:

resultwords  = [word for word in re.split("W+",query) if word and word.lower() not in stopwords]  # filter out empty words

or add empty string to the list of stopwords 🙂

stopwords = {'what','who','is','a','at','is','he',''}

now the code prints:

['hello','Says']
stopwords=['for','or','to']
p='Asking for help, clarification, or responding to other answers.'
for i in stopwords:
  n=p.replace(i,'')
  p=n
print(p)
Answered By: user14155892
" ".join([x for x in query.split() if x not in stopwords])
Answered By: Vito Gentile
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.