How to remove certain words from text while keeping punctuation marks

Question:

I have the following code that removes Bangla words from given text. It can remove listed words from text successfully, but it fails to remove a word with punctuation. For example, here, from input text "বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।", it can remove, "তথা" and "না"(since listed in word_list) but it can’t remove না with punctuations( "না," and "না।" ) . I want to remove words with punctuations as well but keeping the punctuations. Please see the current and expected output below. Thanks a lot. Punctuation list=[,।?]

word_list = {'নিজের', 'তথা', 'না'}
def remove_w(text):
    return ' '.join(w for w in text.split() if w not in word_list)
remove_w('বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।')

Current Output::: ‘বিশ্বের দূষিত বায়ুর না, শহরের না।’

Expected Output::: ‘বিশ্বের দূষিত বায়ুর, শহরের।’

Asked By: Ishrat Hossain

||

Answers:

It seems, that you have to use a more complex split operation like

  from re import compile

  reSeparator = compile("[ ,]+")

  if __name__ == "__main__":
      print(reSeparator.split("a,  b,, c "))

(I can’t imagine, what other punctuation marks apply to that character set.) Otherwise "a," is of course different from "a", so in word_list will not return true. Note that trailing separators will give an empty element.

Answered By: guidot

The following code does what you desire:

import re

word_list = {'নিজের', 'তথা', 'না'}
def remove_w(text):
    result = ''.join(w for w in re.split(r'([ ,।?])', text) if w not in word_list)
    # sanitize the result:
    result = re.sub(r' +([,।?])', r'1', result)
    result = re.sub(r' +', r' ', result)
    return result

remove_w('বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।')
# 'বিশ্বের দূষিত বায়ুর, শহরের।'

The parentheses within r'([ ,।?])' serve to keep the delimiters in the result:

re.split(pattern, string) [simplified: default arguments omitted]
If capturing parentheses are used in pattern, then the text of all groups in the pattern [is] also returned as part of the resulting list.

Note that we need manual sanitization of the result:

  • spaces before punctuation will be removed
  • multiple successive spaces are merged into a single space

I would also like to draw other readers’ attention to the fact that is a Bengali punctuation mark (দাঁড়ি, in English commonly referred to as daṇḍa), not an ASCII vertical bar |.

Answered By: Lover of Structure
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.