real word count in NLTK

Question

The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count:

text = nltk.Text(tokens)
len(text)

However, it doesn’t – it gives a word and punctuation count.
How can you get a real word count (ignoring punctuation)?

Similarly, how can you get the average number of characters in a word?
The obvious answer is:

word_average_length =(len(string_of_text)/len(text))

However, this would be off because:

len(string_of_text) is a character count, including spaces
len(text) is a token count, excluding spaces but including punctuation marks, which aren’t words.

Am I missing something here? This must be a very common NLP task…

Asked By: Zach

||

Source

Answer 1

Removing Punctuation

Use a regular expression to filter out the punctuation

import re
from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
>>> filtered = [w for w in text if nonPunct.match(w)]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

Average Number of Characters

Sum the lengths of each word. Divide by the number of words.

>>> float(sum(map(len, filtered))) / len(filtered)
3.75

Or you could make use of the counts you already did to prevent some re-computation. This multiplies the length of the word by the number of times we saw it, then sums all of that up.

>>> float(sum(len(w)*c for w,c in counts.iteritems())) / len(filtered)
3.75

Answered By: dhg

Answer 2

Tokenization with nltk

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'w+')
text = "This is my text. It icludes commas, question marks? and other stuff. Also U.S.."
tokens = tokenizer.tokenize(text)

Returns

['This', 'is', 'my', 'text', 'It', 'icludes', 'commas', 'question', 'marks', 'and', 'other', 'stuff', 'Also', 'U', 'S']

Answered By: petra

Answer 3

Removing Punctuation (with no regex)

Use the same solution as dhg, but test that a given token is alphanumeric instead of using a regex pattern.

from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> filtered = [w for w in text if w.isalnum()]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

Advantages:

Works better with non English languages as "À".isalnum() is True while bool(nonPunct.match(“à”)) is False (an “à” is not a punctuation mark at least in French).
Does not need to use the re package.

Answered By: Adrien Pacifico

Answer 4

Removing punctuation

from string import punctuation 
text = [word for word in text if word not in punctuation]

The average number of character in a word on a text

from collections import Counter
from nltk import word_tokenize

word_count = Counter(word_tokenize(text))
sum(len(x)* y for x, y in word_count.items()) / len(text)

Answered By: rad15f

real word count in NLTK

Question:

Answers:

Removing Punctuation

Average Number of Characters

Removing Punctuation (with no regex)