Calculate TF-IDF using sklearn for variable-n-grams in python


using scikit-learn to find the number of hits of variable n-grams of a particular vocabulary.

I got examples from here.

Imagine I have a corpus and I want to find how many hits (counting) has a vocabulary like the following one:

myvocabulary = [(window=4, words=['tin', 'tan']),
                (window=3, words=['electrical', 'car'])
                (window=3, words=['elephant','banana'])

What I call here window is the length of the span of words in which the words can appear. as follows:

‘tin tan’ is hit (within 4 words)

‘tin dog tan’ is hit (within 4 words)

‘tin dog cat tan is hit (within 4 words)

‘tin car sun eclipse tan’ is NOT hit. tin and tan appear more than 4 words away from each other.

I just want to count how many times (window=4, words=[‘tin’, ‘tan’]) appears in a text and the same for all the other ones and then add the result to a pandas in order to calculate a tf-idf algorithm.
I could only find something like this:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())

where vocabulary is a simple list of strings, being single words or several words.

besides from scikitlearn:

class sklearn.feature_extraction.text.CountVectorizer
ngram_range : tuple (min_n, max_n)

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

does not help neither.

Any ideas?

Asked By: JFerro



I am not sure if this can be done using CountVectorizer or TfidfVectorizer. I have written my own function for doing this as follows:

import pandas as pd
import numpy as np
import string 

def contained_within_window(token, word1, word2, threshold):
  word1 = word1.lower()
  word2 = word2.lower()
  token = token.translate(str.maketrans('', '', string.punctuation)).lower()
  if (word1 in token) and word2 in (token):
      word_list = token.split(" ")
      word1_index = [i for i, x in enumerate(word_list) if x == word1]
      word2_index = [i for i, x in enumerate(word_list) if x == word2]
      count = 0
      for i in word1_index:
        for j in word2_index:
          if np.abs(i-j) <= threshold:
      return count
  return 0


corpus = [
    'This is the first document. And this is what I want',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    'I like coding in sklearn',
    'This is a very good question'

df = pd.DataFrame(corpus, columns=["Test"])

your df will look like this:

0   This is the first document. And this is what I...
1   This document is the second document.
2   And this is the third one.
3   Is this the first document?
4   I like coding in sklearn
5   This is a very good question

Now you can apply contained_within_window as follows:

sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))

And you get:


You can just run a for loop for checking different instances.
And you this to construct your pandas df and apply TfIdf on it, which is straight forward.