My strip() function is not removing
Question:
My intention is to have a whole lot of text and translate it into all lower case first. (Which it does) Then, remove the punctuation marks in the text.(Which it does not) Finally, print out the frequency of the word used. (It prints out test. and test as two different things.)
from collections import Counter
text = """
Test. test test. Test Test test.
""".lower().strip(".")
words = text.split()
counts = Counter(words)
print(counts)
Any help would be appreciated.
Answers:
You need .replace('.', '')
in place of strip
You can split the text in a list and then strip the punctuation, or use roganjosh’s suggestion, which is to use .replace(‘.’, ”):
Way 1:
text = "Test. test test. Test Test test."
word = text.split()
the_list = [i.strip('.') for i in word]
counts = Counter(the_list)
Note that for .strip(), only punctuation at the end of a string will be removed, not in the middle.
Way 2:
new_text = text.replace('.', '')
counts = Counter(new_text)
If all you want is to extract words (for counting or any other reason), use regular expressions re.findall
(or re.finditer
if the texts are big and you don’t want to collect all the matches in memory):
import re
text = """
Test. test test. Test Test test.
"""
# Counter({'test': 6})
counts = Counter(re.findall("w+", text))
Note this may be trickier with the non-ASCII texts (and doesn’t account for, e.g. words-with-dashes).
To replace all characters you need to work with it word by word.
strip
is an amazing function and you can use it to remove multiple characters all at one, but the problem with strip()
is that it will stop after the first whitespace it find.
word = text.split()
text_list = [i.strip('.') for i in word]
count = len(text_list)
text = " ".join(text_list)
This way you work with each word.
Hope this helps
My intention is to have a whole lot of text and translate it into all lower case first. (Which it does) Then, remove the punctuation marks in the text.(Which it does not) Finally, print out the frequency of the word used. (It prints out test. and test as two different things.)
from collections import Counter
text = """
Test. test test. Test Test test.
""".lower().strip(".")
words = text.split()
counts = Counter(words)
print(counts)
Any help would be appreciated.
You need .replace('.', '')
in place of strip
You can split the text in a list and then strip the punctuation, or use roganjosh’s suggestion, which is to use .replace(‘.’, ”):
Way 1:
text = "Test. test test. Test Test test."
word = text.split()
the_list = [i.strip('.') for i in word]
counts = Counter(the_list)
Note that for .strip(), only punctuation at the end of a string will be removed, not in the middle.
Way 2:
new_text = text.replace('.', '')
counts = Counter(new_text)
If all you want is to extract words (for counting or any other reason), use regular expressions re.findall
(or re.finditer
if the texts are big and you don’t want to collect all the matches in memory):
import re
text = """
Test. test test. Test Test test.
"""
# Counter({'test': 6})
counts = Counter(re.findall("w+", text))
Note this may be trickier with the non-ASCII texts (and doesn’t account for, e.g. words-with-dashes).
To replace all characters you need to work with it word by word.
strip
is an amazing function and you can use it to remove multiple characters all at one, but the problem with strip()
is that it will stop after the first whitespace it find.
word = text.split()
text_list = [i.strip('.') for i in word]
count = len(text_list)
text = " ".join(text_list)
This way you work with each word.
Hope this helps