My strip() function is not removing

Question:

My intention is to have a whole lot of text and translate it into all lower case first. (Which it does) Then, remove the punctuation marks in the text.(Which it does not) Finally, print out the frequency of the word used. (It prints out test. and test as two different things.)

from collections import Counter



text = """
Test. test test. Test Test test. 
""".lower().strip(".")



words = text.split()
counts = Counter(words)
print(counts)

Any help would be appreciated.

Asked By: user7884512

||

Answers:

You need .replace('.', '') in place of strip

Answered By: zengr

You can split the text in a list and then strip the punctuation, or use roganjosh’s suggestion, which is to use .replace(‘.’, ”):

Way 1:

text = "Test. test test. Test Test test."
word = text.split()
the_list = [i.strip('.') for i in word]
counts = Counter(the_list)

Note that for .strip(), only punctuation at the end of a string will be removed, not in the middle.

Way 2:

new_text = text.replace('.', '')
counts = Counter(new_text)
Answered By: Ajax1234

If all you want is to extract words (for counting or any other reason), use regular expressions re.findall (or re.finditer if the texts are big and you don’t want to collect all the matches in memory):

import re

text = """
Test. test test. Test Test test. 
"""

# Counter({'test': 6})
counts = Counter(re.findall("w+", text))

Note this may be trickier with the non-ASCII texts (and doesn’t account for, e.g. words-with-dashes).

Answered By: drdaeman

To replace all characters you need to work with it word by word.

strip is an amazing function and you can use it to remove multiple characters all at one, but the problem with strip() is that it will stop after the first whitespace it find.

word = text.split()
text_list = [i.strip('.') for i in word]
count = len(text_list)
text = " ".join(text_list)

This way you work with each word.

Hope this helps

Answered By: yatabani
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.