check how many times each word from a unique list appears in a dataset

Question

I have a list of unique tokens

unique_words

and a dataset column that has text

dataset['text']

I want to count how many times each element of unique_words appears in my entire text data and display k most common of those words.

unique_words = ['ab', 'bc', 'cd', 'de']

id	text
1x	ab cd de th sk gl wlqm dhwka oqbdm
p2	de de de lm eh nfkie qhas hof

3 most common words:

'de', 100
'ab', 11
'cd', 5

Asked By: sasha11

||

Source

Answer 1

Method 1: Using pandas

This method using vectorized str methods to

split the string into tokens
expand them and stack them
use value_counts() to get frequency counts
filter indexes based on unique_words
fetch top k using .head() as value_counts() already sort counts in descending order

import pandas as pd

unique_words = ['ab', 'cd', 'de', 'bc']
counts = dataset['text'].str.split(expand=True).stack().value_counts()
top3 = counts[counts.index.isin(unique_words)].head(3)
top3

de    4
ab    1
cd    1
dtype: int64

Method 2: Using sklearn’s CountVectorizer

You can use CountVectorizer() from sklearn to get token frequencies for the unique_words by setting them as your vocabulary.

Here is a code example for the sample dataset you have updated in your question.

Initialize a CountVectorizer with the vocabulary set to unique_words using CountVectorizer(vocabulary=unique_words)
Fit and transform the sentences in the text column using this vectorizer, and then convert it into an array, using cnt.fit_transform(dataset['text']).toarray()
Take the sum of the occurrences of each word in the vocab across the sentences by using mat.sum(0)
Finally, save it as a series, and use .nlargest(3) to get the top k keywords based on the frequency of occurrence across the dataset.

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

unique_words = ['ab', 'cd', 'de', 'bc']
cnt = CountVectorizer(vocabulary=unique_words)
mat = cnt.fit_transform(dataset['text'])
tot = mat.sum(0)
top3 = pd.DataFrame(tot.T, index=unique_words)[0].nlargest(3)
top3

de    4
ab    1
cd    1
dtype: int64

Read more about sklearn’s CountVectorizer here.

Method 3: Using collections.Counter

First convert the series of sentences to a list using .tolist()
Next, map str.split to this iterator to break the sentences into tokens resulting in a list of lists
Next, use itertools.chain to merge these lists into a chain object
Then use Counter to get word counts in this chain object for all tokens
Then, you can use a dict comprehension to get only those tokens that are in your unique_words list and convert it back to a Counter
Finally, use counter.most_common(3) to get the top k keys based on frequency.

from collections import Counter
from itertools import chain

counter = Counter(chain.from_iterable(map(str.split, dataset['text'].tolist())))
filtered = Counter({word: counter.get(word,0) for word in unique_words})
top3 = filtered.most_common(3)
top3

[('de', 4), ('ab', 1), ('cd', 1)]

Answered By: Akshay Sehgal

check how many times each word from a unique list appears in a dataset

Question:

Answers:

Method 1: Using pandas

Method 2: Using sklearn’s CountVectorizer

Method 3: Using collections.Counter