How to remove certain words from text while keeping punctuation marks
Question:
I have the following code that removes Bangla words from given text. It can remove listed words from text successfully, but it fails to remove a word with punctuation. For example, here, from input text "বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।", it can remove, "তথা" and "না"(since listed in word_list) but it can’t remove না with punctuations( "না," and "না।" ) . I want to remove words with punctuations as well but keeping the punctuations. Please see the current and expected output below. Thanks a lot. Punctuation list=[,।?]
word_list = {'নিজের', 'তথা', 'না'}
def remove_w(text):
return ' '.join(w for w in text.split() if w not in word_list)
remove_w('বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।')
Current Output::: ‘বিশ্বের দূষিত বায়ুর না, শহরের না।’
Expected Output::: ‘বিশ্বের দূষিত বায়ুর, শহরের।’
Answers:
It seems, that you have to use a more complex split operation like
from re import compile
reSeparator = compile("[ ,]+")
if __name__ == "__main__":
print(reSeparator.split("a, b,, c "))
(I can’t imagine, what other punctuation marks apply to that character set.) Otherwise "a," is of course different from "a", so in word_list
will not return true. Note that trailing separators will give an empty element.
The following code does what you desire:
import re
word_list = {'নিজের', 'তথা', 'না'}
def remove_w(text):
result = ''.join(w for w in re.split(r'([ ,।?])', text) if w not in word_list)
# sanitize the result:
result = re.sub(r' +([,।?])', r'1', result)
result = re.sub(r' +', r' ', result)
return result
remove_w('বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।')
# 'বিশ্বের দূষিত বায়ুর, শহরের।'
The parentheses within r'([ ,।?])'
serve to keep the delimiters in the result:
re.split(pattern, string)
[simplified: default arguments omitted]
If capturing parentheses are used in pattern, then the text of all groups in the pattern [is] also returned as part of the resulting list.
Note that we need manual sanitization of the result:
- spaces before punctuation will be removed
- multiple successive spaces are merged into a single space
I would also like to draw other readers’ attention to the fact that ।
is a Bengali punctuation mark (দাঁড়ি, in English commonly referred to as daṇḍa), not an ASCII vertical bar |
.
I have the following code that removes Bangla words from given text. It can remove listed words from text successfully, but it fails to remove a word with punctuation. For example, here, from input text "বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।", it can remove, "তথা" and "না"(since listed in word_list) but it can’t remove না with punctuations( "না," and "না।" ) . I want to remove words with punctuations as well but keeping the punctuations. Please see the current and expected output below. Thanks a lot. Punctuation list=[,।?]
word_list = {'নিজের', 'তথা', 'না'}
def remove_w(text):
return ' '.join(w for w in text.split() if w not in word_list)
remove_w('বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।')
Current Output::: ‘বিশ্বের দূষিত বায়ুর না, শহরের না।’
Expected Output::: ‘বিশ্বের দূষিত বায়ুর, শহরের।’
It seems, that you have to use a more complex split operation like
from re import compile
reSeparator = compile("[ ,]+")
if __name__ == "__main__":
print(reSeparator.split("a, b,, c "))
(I can’t imagine, what other punctuation marks apply to that character set.) Otherwise "a," is of course different from "a", so in word_list
will not return true. Note that trailing separators will give an empty element.
The following code does what you desire:
import re
word_list = {'নিজের', 'তথা', 'না'}
def remove_w(text):
result = ''.join(w for w in re.split(r'([ ,।?])', text) if w not in word_list)
# sanitize the result:
result = re.sub(r' +([,।?])', r'1', result)
result = re.sub(r' +', r' ', result)
return result
remove_w('বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।')
# 'বিশ্বের দূষিত বায়ুর, শহরের।'
The parentheses within r'([ ,।?])'
serve to keep the delimiters in the result:
re.split(pattern, string)
[simplified: default arguments omitted]
If capturing parentheses are used in pattern, then the text of all groups in the pattern [is] also returned as part of the resulting list.
Note that we need manual sanitization of the result:
- spaces before punctuation will be removed
- multiple successive spaces are merged into a single space
I would also like to draw other readers’ attention to the fact that ।
is a Bengali punctuation mark (দাঁড়ি, in English commonly referred to as daṇḍa), not an ASCII vertical bar |
.