How to remove a list of words from a text ONLY IF it is a whole word, not a part of a word

Question:

I have a list of word that I want to remove from a given text. With my limited python knowledge, I tried to replace those list of words with null value in a loop. It worked ok but the problem is it replaced all string matched to it even chunk of a word. Please look the code and output below:

word_list = {'the', 'mind', 'pen'}
def remove_w(text):
  for word in word_list:
    text = text.replace(word, '')
  return text
remove_w('A pencil is over a thermometer with mind itself.')

The output is:

‘A cil is over a rmometer with itself.’

It removed part of some words. However, clearly I wanted the following output below.

A pencil is over a thermometer with itself.

How to remove such list of words from a text ONLY IF it is a whole word, not a part of a word. (Since I will use it on large articles, please suggest a way that is faster approach) Thank you.

Asked By: Rahat Ahmed

||

Answers:

You can use a regular expression with word boundaries.

pattern = re.compile('|'.join(rf'b{re.escape(w)}b' for w in word_list))
def remove_w(text):
    return pattern.sub('', text)

Alternatively, use str.split to separate into words delimited by spaces, remove the words exactly matching one of those in the set, then join it back together.

def remove_w(text):
    return ' '.join(w for w in text.split() if w not in word_list)
Answered By: Unmitigated

You can use regular expressions to remove whole words from the text while taking care not to remove parts of other words. In your specific case, you can use the re module to achieve that:

import re

word_list = {"the", "mind", "pen"}
word_pattern = r"(s?)b(?:" + "|".join(re.escape(word) for word in word_list) + r")b"
pattern = re.compile(word_pattern)

def remove_w(text):
    return pattern.sub("", text)

text = "A pencil is over a thermometer with mind itself."
result = remove_w(text)
print(result)

The output will be:

A pencil is over a thermometer with itself.

Explanation:

  1. re.escape(word): Escapes any characters that might have a special meaning in regular expressions, like ., ?, *, etc.
  2. (s?): Matches any whitespace and ? to make it optional.
  3. '|'.join(...): Joins the words together with the regex OR pattern |.
  4. b: Matches the empty string but only at the beginning or end of a word.
  5. pattern.sub('', text): Replaces the matched words in the text with an empty string.

This approach should work efficiently even for large articles, as the regular expression engine is optimized for text processing tasks like these.

Answered By: Marwan Elzainy
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.