Built-in function to get the frequency of one word with spaCy?

Question:

I’m looking for faster alternatives to NLTK to analyze big corpora and do basic things like calculating frequencies, PoS tagging etc… SpaCy seems great and easy to use in many ways, but I can’t find any built-in function to count the frequency of a specific word for example. I’ve looked at the spaCy documentation, but I can’t find a straightforward way to do it. Am I missing something?

What I would like would be the NLTK equivalent of:

tokens.count("word") #where tokens is the tokenized text in which the word is to be counted

In NLTK, the above code would tell me that in my text, the word “word” appears X number of times.

Note that I’ve come by the count_by function, but it doesn’t seem to do what I’m looking for.

Asked By: Michael Gauthier

||

Answers:

Python stdlib includes collections.Counter for this kind of purpose. You have not given me an answer if this answer suits your case.

from collections import Counter

text = "Lorem Ipsum is simply dummy text of the  ...."

freq = Counter(text.split())
print(freq)

>>> Counter({'the': 6, 'Lorem': 4, 'of': 4, 'Ipsum': 3, 'dummy': 2 ...})

print(freq['Lorem'])

>>> 4

Alright just to give some time reference, I have used this script,

import random, timeit
from collections import Counter

def loadWords():
    with open('corpora.txt', 'w') as corpora:
        randWords = ['foo', 'bar', 'life', 'car', 'wrong',
                     'right', 'left', 'plain', 'random', 'the']
        for i in range(100000000):
            corpora.write(randWords[random.randint(0, 9)] + " ")

def countWords():
    with open('corpora.txt', 'r') as corpora:
        content = corpora.read()
        myDict = Counter(content.split())
        print("foo: ", myDict['foo'])

print(timeit.timeit(loadWords, number=1))
print(timeit.timeit(countWords, number=1))

Results,

149.01646934738716
foo: 9998872
18.093295297389773

Still I am not sure if this is enough for you.

Answered By: BcK

I use spaCy for frequency counts in corpora quite often. This is what I usually do:

import spacy
nlp = spacy.load("en_core_web_sm")

list_of_words = ['run', 'jump', 'catch']

def word_count(string):
    words_counted = 0
    my_string = nlp(string)

    for token in my_string:
        # actual word
        word = token.text
        # lemma
        lemma_word = token.lemma_
        # part of speech
        word_pos = token.pos_
        if lemma_word in list_of_words:
            words_counted += 1
            print(lemma_word)
    return words_counted


sentence = "I ran, jumped, and caught the ball."
words_counted = word_count(sentence)
print(words_counted)


Answered By: Nester

Updating with this answer as this is the page I found when searching for an answer for this specific problem. I find that this is an easier solution than the ones provided before and that only uses spaCy.

As you mentioned spaCy Doc object has the built in method Doc.count_by. From what I understand of your question it does what you ask for but it is not obvious.

It counts the occurances of an given attribute and returns a dictionary with the attributes hash as key in integer form and the counts.

Solution

First of all we need to import ORTH from spacy.attr. ORTH is the exact verbatim text of a token. We also need to load the model and provide a text.

import spacy
from spacy.attrs import ORTH

nlp = spacy.load("en_core_web_sm")

doc = nlp("apple apple orange banana")

Then we create a dictionary of word counts

count_dict = doc.count_by(ORTH)

You could count by other attributes like LEMMA, just import the attribute you wish to use.

If we look at the dictionary we will se that it contains the hash for the lexeme and the word count.

count_dict

Results:

{8566208034543834098: 2, 2208928596161743350: 1, 2525716904149915114: 1}

We can get the text for the word if we look up the hash in the vocab.

nlp.vocab.strings[8566208034543834098]

Returns

'apple'

With this we can create a simple function that takes the search word and a count dict created with the Doc.count_by method.

def get_word_count(word, count_dict):
    return count_dict[nlp.vocab.strings[word]]

If we run the function with our search word ‘apple’ and the count dict we created earlier

get_word_count('apple', count_dict)

We get:

2

https://spacy.io/api/doc#count_by

Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.