# Calculate TF-IDF using sklearn for variable-n-grams in python

## Question:

Problem:
using scikit-learn to find the number of hits of variable n-grams of a particular vocabulary.

Explanation.
I got examples from here.

Imagine I have a corpus and I want to find how many hits (counting) has a vocabulary like the following one:

``````myvocabulary = [(window=4, words=['tin', 'tan']),
(window=3, words=['electrical', 'car'])
(window=3, words=['elephant','banana'])
``````

What I call here window is the length of the span of words in which the words can appear. as follows:

‘tin tan’ is hit (within 4 words)

‘tin dog tan’ is hit (within 4 words)

‘tin dog cat tan is hit (within 4 words)

‘tin car sun eclipse tan’ is NOT hit. tin and tan appear more than 4 words away from each other.

I just want to count how many times (window=4, words=[‘tin’, ‘tan’]) appears in a text and the same for all the other ones and then add the result to a pandas in order to calculate a tf-idf algorithm.
I could only find something like this:

``````from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
``````

where vocabulary is a simple list of strings, being single words or several words.

besides from scikitlearn:

``````class sklearn.feature_extraction.text.CountVectorizer
ngram_range : tuple (min_n, max_n)
``````

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

does not help neither.

Any ideas?

I am not sure if this can be done using `CountVectorizer` or `TfidfVectorizer`. I have written my own function for doing this as follows:

``````import pandas as pd
import numpy as np
import string

def contained_within_window(token, word1, word2, threshold):
word1 = word1.lower()
word2 = word2.lower()
token = token.translate(str.maketrans('', '', string.punctuation)).lower()
if (word1 in token) and word2 in (token):
word_list = token.split(" ")
word1_index = [i for i, x in enumerate(word_list) if x == word1]
word2_index = [i for i, x in enumerate(word_list) if x == word2]
count = 0
for i in word1_index:
for j in word2_index:
if np.abs(i-j) <= threshold:
count=count+1
return count
return 0
``````

SAMPLE:

``````corpus = [
'This is the first document. And this is what I want',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'I like coding in sklearn',
'This is a very good question'
]

df = pd.DataFrame(corpus, columns=["Test"])
``````

your `df` will look like this:

``````    Test
0   This is the first document. And this is what I...
1   This document is the second document.
2   And this is the third one.
3   Is this the first document?
4   I like coding in sklearn
5   This is a very good question
``````

Now you can apply `contained_within_window` as follows:

``````sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))
``````

And you get:

``````2
``````

You can just run a `for` loop for checking different instances.
And you this to construct your pandas `df ` and apply `TfIdf` on it, which is straight forward.