check how many times each word from a unique list appears in a dataset

Question:

I have a list of unique tokens

unique_words 

and a dataset column that has text

dataset['text']

I want to count how many times each element of unique_words appears in my entire text data and display k most common of those words.

unique_words = ['ab', 'bc', 'cd', 'de']
id text
1x ab cd de th sk gl wlqm dhwka oqbdm
p2 de de de lm eh nfkie qhas hof

3 most common words:

'de', 100
'ab', 11
'cd', 5
Asked By: sasha11

||

Answers:

Method 1: Using pandas

This method using vectorized str methods to

  • split the string into tokens
  • expand them and stack them
  • use value_counts() to get frequency counts
  • filter indexes based on unique_words
  • fetch top k using .head() as value_counts() already sort counts in descending order
import pandas as pd

unique_words = ['ab', 'cd', 'de', 'bc']
counts = dataset['text'].str.split(expand=True).stack().value_counts()
top3 = counts[counts.index.isin(unique_words)].head(3)
top3
de    4
ab    1
cd    1
dtype: int64

Method 2: Using sklearn’s CountVectorizer

You can use CountVectorizer() from sklearn to get token frequencies for the unique_words by setting them as your vocabulary.

Here is a code example for the sample dataset you have updated in your question.

  1. Initialize a CountVectorizer with the vocabulary set to unique_words using CountVectorizer(vocabulary=unique_words)
  2. Fit and transform the sentences in the text column using this vectorizer, and then convert it into an array, using cnt.fit_transform(dataset['text']).toarray()
  3. Take the sum of the occurrences of each word in the vocab across the sentences by using mat.sum(0)
  4. Finally, save it as a series, and use .nlargest(3) to get the top k keywords based on the frequency of occurrence across the dataset.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

unique_words = ['ab', 'cd', 'de', 'bc']
cnt = CountVectorizer(vocabulary=unique_words)
mat = cnt.fit_transform(dataset['text'])
tot = mat.sum(0)
top3 = pd.DataFrame(tot.T, index=unique_words)[0].nlargest(3)
top3
de    4
ab    1
cd    1
dtype: int64

Read more about sklearn’s CountVectorizer here.


Method 3: Using collections.Counter

  • First convert the series of sentences to a list using .tolist()
  • Next, map str.split to this iterator to break the sentences into tokens resulting in a list of lists
  • Next, use itertools.chain to merge these lists into a chain object
  • Then use Counter to get word counts in this chain object for all tokens
  • Then, you can use a dict comprehension to get only those tokens that are in your unique_words list and convert it back to a Counter
  • Finally, use counter.most_common(3) to get the top k keys based on frequency.
from collections import Counter
from itertools import chain

counter = Counter(chain.from_iterable(map(str.split, dataset['text'].tolist())))
filtered = Counter({word: counter.get(word,0) for word in unique_words})
top3 = filtered.most_common(3)
top3
[('de', 4), ('ab', 1), ('cd', 1)]
Answered By: Akshay Sehgal
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.