tf-idf | py4u

Get data from .pickle

Get data from .pickle Question: I have a model of Multinomial NB(): text_clf_NB = Pipeline([(‘vect’, CountVectorizer()), (‘tfidf’, TfidfTransformer()), (‘clf’, MultinomialNB()), ]) text_clf_NB.fit(Train_X_NB, Train_Y_NB) I save it to .pickle pickle.dump(text_clf_NB, open("NB_classification.pickle", "wb")) In another case I load this model: clf = pickle.load(open("NB_classification.pickle", "rb")) Can you help me, please, how can I get sparse matrix of Train …

Total answers: 1

NotFittedError: The TF-IDF vectorizer is not fitted

NotFittedError: The TF-IDF vectorizer is not fitted Question: I’ve trained a sentiment analysis classifier using TripAdvisor’s textual reviews datasets. It can predict the input textual reviews’ rating based on sentiment. Everything is ok with the training and testing. However, when I loaded the classifier in a new .ipynb file and tried to use a review …

Total answers: 1

Calculate TF-IDF using sklearn for variable-n-grams in python

Calculate TF-IDF using sklearn for variable-n-grams in python Question: Problem: using scikit-learn to find the number of hits of variable n-grams of a particular vocabulary. Explanation. I got examples from here. Imagine I have a corpus and I want to find how many hits (counting) has a vocabulary like the following one: myvocabulary = [(window=4, …

Total answers: 1

Why is the value of TF-IDF different from IDF_?

Why is the value of TF-IDF different from IDF_? Question: Why is the value of the vectorized corpus different from the value obtained through the idf_ attribute? Should not the idf_ attribute just return the inverse document frequency (IDF) in the same way it appears in the corpus vectorized? from sklearn.feature_extraction.text import TfidfVectorizer corpus = …

Total answers: 1

Is smooth_idf redundant?

Is smooth_idf redundant? Question: The scikit-learn documentation says If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(d, t) = log [ (1 + n) / (1 + …

Total answers: 2

Calculate TF-IDF using sklearn for n-grams in python

Calculate TF-IDF using sklearn for n-grams in python Question: I have a vocabulary list that include n-grams as follows. myvocabulary = [‘tim tam’, ‘jam’, ‘fresh milk’, ‘chocolates’, ‘biscuit pudding’] I want to use these words to calculate TF-IDF values. I also have a dictionary of corpus as follows (key = recipe number, value = recipe). …

Total answers: 2

TfidfVectorizer in scikit-learn : ValueError: np.nan is an invalid document

TfidfVectorizer in scikit-learn : ValueError: np.nan is an invalid document Question: I’m using TfidfVectorizer from scikit-learn to do some feature extraction from text data. I have a CSV file with a Score (can be +1 or -1) and a Review (text). I pulled this data into a DataFrame so I can run the Vectorizer. This …

Total answers: 3

Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score Question: I am working on keyword extraction problem. Consider the very general case from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words=’english’) t = """Two Travellers, walking in the noonday sun, sought the shade of a widespreading tree to rest. As they …

Total answers: 3

Keep TFIDF result for predicting new content using Scikit for Python

Keep TFIDF result for predicting new content using Scikit for Python Question: I am using sklearn on Python to do some clustering. I’ve trained 200,000 data, and code below works well. corpus = open(“token_from_xml.txt”) vectorizer = CountVectorizer(decode_error=”replace”) transformer = TfidfTransformer() tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus)) km = KMeans(30) kmresult = km.fit(tfidf).predict(tfidf) But when I have new testing …

Total answers: 5

How to see top n entries of term-document matrix after tfidf in scikit-learn

How to see top n entries of term-document matrix after tfidf in scikit-learn Question: I am new to scikit-learn, and I was using TfidfVectorizer to find the tfidf values of terms in a set of documents. I used the following code to obtain the same. vectorizer = TfidfVectorizer(stop_words=u’english’,ngram_range=(1,5),lowercase=True) X = vectorizer.fit_transform(lectures) Now If I print …

Total answers: 1