# Calculate TF-IDF using sklearn for variable-n-grams in python

## Question:

Problem:

using scikit-learn to find the number of hits of variable n-grams of a particular vocabulary.

Explanation.

I got examples from here.

Imagine I have a corpus and I want to find how many hits (counting) has a vocabulary like the following one:

```
myvocabulary = [(window=4, words=['tin', 'tan']),
(window=3, words=['electrical', 'car'])
(window=3, words=['elephant','banana'])
```

What I call here window is the length of the span of words in which the words can appear. as follows:

‘tin tan’ is hit (within 4 words)

‘tin dog tan’ is hit (within 4 words)

‘tin dog cat tan is hit (within 4 words)

‘tin car sun eclipse tan’ is NOT hit. tin and tan appear more than 4 words away from each other.

I just want to count how many times (window=4, words=[‘tin’, ‘tan’]) appears in a text and the same for all the other ones and then add the result to a pandas in order to calculate a tf-idf algorithm.

I could only find something like this:

```
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
```

where vocabulary is a simple list of strings, being single words or several words.

besides from scikitlearn:

```
class sklearn.feature_extraction.text.CountVectorizer
ngram_range : tuple (min_n, max_n)
```

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

does not help neither.

Any ideas?

## Answers:

I am not sure if this can be done using `CountVectorizer`

or `TfidfVectorizer`

. I have written my own function for doing this as follows:

```
import pandas as pd
import numpy as np
import string
def contained_within_window(token, word1, word2, threshold):
word1 = word1.lower()
word2 = word2.lower()
token = token.translate(str.maketrans('', '', string.punctuation)).lower()
if (word1 in token) and word2 in (token):
word_list = token.split(" ")
word1_index = [i for i, x in enumerate(word_list) if x == word1]
word2_index = [i for i, x in enumerate(word_list) if x == word2]
count = 0
for i in word1_index:
for j in word2_index:
if np.abs(i-j) <= threshold:
count=count+1
return count
return 0
```

SAMPLE:

```
corpus = [
'This is the first document. And this is what I want',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'I like coding in sklearn',
'This is a very good question'
]
df = pd.DataFrame(corpus, columns=["Test"])
```

your `df`

will look like this:

```
Test
0 This is the first document. And this is what I...
1 This document is the second document.
2 And this is the third one.
3 Is this the first document?
4 I like coding in sklearn
5 This is a very good question
```

Now you can apply `contained_within_window`

as follows:

```
sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))
```

And you get:

```
2
```

You can just run a `for`

loop for checking different instances.

And you this to construct your pandas `df `

and apply `TfIdf`

on it, which is straight forward.