scikit-learn: fitting data into chunks vs fitting it all at once

Question:

I am using scikit-learn to build a classifier, which works on (somewhat large) text files. I need a simple bag-of-words features at the moment, therefore I tried using TfidfVectorizer/HashingVectorizer/CountVectorizer to obtain the feature vectors.

However, processing the entire train data at once to obtain the feature vectors results in memory error in numpy/scipy (depending on which vectorizer I use).

When extracting text features from the raw text: if I fit the data to the vectorizer in chunks, will that be the same as fitting the entire data at once?

To illustrate this with code, is the following:

vectoriser = CountVectorizer() # or TfidfVectorizer/HashingVectorizer
train_vectors = vectoriser.fit_transform(train_data)

different from the following:

vectoriser = CountVectorizer() # or TfidfVectorizer/HashingVectorizer


start = 0
while start < len(train_data):
    vectoriser.fit(train_data[start:(start+500)])
    start += 500

train_vectors = vectoriser.transform(train_data)
Asked By: DarkMatter

||

Answers:

I’m not a text-feature extracting expert, but based on the documentation and my other classifier base experiences:

If I do several fits on chunks of the training data, will that be the
same as fitting the entire data at once?

You can’t directly merge the extracted features, because you will get different importances i.e. weights for the same token/word getting from the different chunk in different proportion to other words of the chunk, represented with a different key.

You can use any feature extracting method, the usefulness of the result depends of the task, I think.

But if you can use different chunks’ different features for classification on the same data. Once you get several different outputs with the features you acquired with the same feature extracting method(or you can use different extracting method too), you can use them as an input to a “merging” mechanism like bagging, boosting etc.
Actually after the entire process above in most case you will get a better final output, than you fed the full file in one “full-featured” but even a simple classifier.

Answered By: Geeocode

It depends on the vectorizer you are using.

CountVectorizer counts occurences of the words in the documents.
It outputs for each document a (n_words, 1) vector with the number of times each word appears in the document. n_words is the total number of words in the documents (aka the size of the vocabulary).
It also fits a vocabulary so that you can introspect the model (see what word is important, etc.). You can have a look at it using vectorizer.get_feature_names().

When you fit it on your first 500 documents, the vocabulary will only be made of the words from the 500 documents. Say there are 30k of them, fit_transform outputs a 500x30k sparse matrix.
Now you fit_transform again with the 500 next documents, but they contain only 29k words so you get a 500x29k matrix…
Now, how do you align your matrices to make sure all documents have a consistent representation?
I can’t think of an easy way to do this at the moment.

With TfidfVectorizer you have another issue, that is the inverse document frequency: to be able to compute document frequency you need to see all the documents at once.
However a TfidfVectorizer is just a CountVectorizer followed by a TfIdfTransformer, so if you manage to get the output of the CountVectorizer right you can then apply a TfIdfTransformer on the data.

With HashingVectorizer things are different: there is no vocabulary here.

In [51]: hvect = HashingVectorizer() 
In [52]: hvect.fit_transform(X[:1000])       
<1000x1048576 sparse matrix of type '<class 'numpy.float64'>'
 with 156733 stored elements in Compressed Sparse Row format>   

Here there are not 1M+ different words in the first 1000 documents, yet the matrix we get has 1M+ columns.
The HashingVectorizer does not store the words in memory. This makes it more memory efficient and makes sure that the matrices it returns always have the same number of columns.
So you don’t have the same problem as with the CountVectorizer here.

This is probably the best solution for the batch processing you described. There are a couple of cons, namely that you cannot get the idf weighting and that you do not know the mapping between words and your features.

The HashingVectorizer documentation references an example that does out-of-core classification on text data. It may be a bit messy but it does what you’d like to do.

Hope this helps.

EDIT:
If you have too much data, HashingVectorizer is the way to go.
If you still want to use CountVectorizer, a possible workaround is to fit the vocabulary yourself and to pass it to your vectorizer so that you only need to call tranform.

Here’s an example you can adapt:

import re
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

news = fetch_20newsgroups()
X, y = news.data, news.target

Now the approach that does not work:

# Fitting directly:
vect = CountVectorizer()
vect.fit_transform(X[:1000])
<1000x27953 sparse matrix of type '<class 'numpy.int64'>'
with 156751 stored elements in Compressed Sparse Row format>

Note the size of the matrix we get.
Fitting the vocabulary ‘manually’:

def tokenizer(doc):
    # Using default pattern from CountVectorizer
    token_pattern = re.compile('(?u)\b\w\w+\b')
    return [t for t in token_pattern.findall(doc)]

stop_words = set() # Whatever you want to have as stop words.
vocabulary = set([word for doc in X for word in tokenizer(doc) if word not in stop_words])

vectorizer = CountVectorizer(vocabulary=vocabulary)
X_counts = vectorizer.transform(X[:1000])
# Now X_counts is:
# <1000x155448 sparse matrix of type '<class 'numpy.int64'>'
#   with 149624 stored elements in Compressed Sparse Row format>
#   
X_tfidf = tfidf.transform(X_counts)

On your example you’ll need to first build the entire matrix X_counts (for all documents) before applying the tfidf transform.

Answered By: ldirer
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.