Scikit-Learn's Pipeline: A sparse matrix was passed, but dense data is required
Question:
I’m finding it difficult to understand how to fix a Pipeline I created (read: largely pasted from a tutorial). It’s python 3.4.2:
df = pd.DataFrame
df = DataFrame.from_records(train)
test = [blah1, blah2, blah3]
pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', RandomForestClassifier())])
pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
predicted = pipeline.predict(test)
When I run it, I get:
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
This is for the line pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
.
I’ve experimented a lot with solutions through numpy, scipy, and so forth, but I still don’t know how to fix it. And yes, similar questions have come up before, but not inside a pipeline.
Where is it that I have to apply toarray
or todense
?
Answers:
you can change pandas Series
to arrays using the .values
method.
pipeline.fit(df[0].values, df[1].values)
However I think the issue here happens because CountVectorizer()
returns a sparse matrix by default, and cannot be piped to the RF classifier. CountVectorizer()
does have a dtype
parameter to specify the type of array returned. That said usually you need to do some sort of dimensionality reduction to use random forests for text classification, because bag of words feature vectors are very long
Unfortunately those two are incompatible. A CountVectorizer
produces a sparse matrix and the RandomForestClassifier requires a dense matrix. It is possible to convert using X.todense()
. Doing this will substantially increase your memory footprint.
Below is sample code to do this based on http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html which allows you to call .todense()
in a pipeline stage.
class DenseTransformer(TransformerMixin):
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, y=None, **fit_params):
return X.todense()
Once you have your DenseTransformer
, you are able to add it as a pipeline step.
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('to_dense', DenseTransformer()),
('classifier', RandomForestClassifier())
])
Another option would be to use a classifier meant for sparse data like LinearSVC
.
from sklearn.svm import LinearSVC
pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', LinearSVC())])
Random forests in 0.16-dev now accept sparse data.
The most terse solution would be use a FunctionTransformer
to convert to dense: this will automatically implement the fit
, transform
and fit_transform
methods as in David’s answer. Additionally if I don’t need special names for my pipeline steps, I like to use the sklearn.pipeline.make_pipeline
convenience function to enable a more minimalist language for describing the model:
from sklearn.preprocessing import FunctionTransformer
pipeline = make_pipeline(
CountVectorizer(),
FunctionTransformer(lambda x: x.todense(), accept_sparse=True),
RandomForestClassifier()
)
with this pipeline add TfidTransformer plus
pipelinEx = Pipeline([('bow',vectorizer),
('tfidf',TfidfTransformer()),
('to_dense', DenseTransformer()),
('classifier',classifier)])
The first line above, gets the word counts for the documents in a sparse matrix form. However, in practice, you may be computing tfidf scores with TfidfTransformer on a set of new unseen documents.
Then, by calling tfidf transformer.transform(vectorizer) you will finally be computing the tf-idf scores for your docs. Internally this is computing the tf * idf multiplication where term frequency is weighted by its idf values.
I found that FunctionTransformer and using x.toarray() rather than x.todense() worked for me.
'pipeline': Pipeline(
[
('vect', TfidfVectorizer()),
('dense', FunctionTransformer(lambda x: x.toarray(), accept_sparse=True)),
('clf', GaussianProcessClassifier())
]
)
I’m finding it difficult to understand how to fix a Pipeline I created (read: largely pasted from a tutorial). It’s python 3.4.2:
df = pd.DataFrame
df = DataFrame.from_records(train)
test = [blah1, blah2, blah3]
pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', RandomForestClassifier())])
pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
predicted = pipeline.predict(test)
When I run it, I get:
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
This is for the line pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
.
I’ve experimented a lot with solutions through numpy, scipy, and so forth, but I still don’t know how to fix it. And yes, similar questions have come up before, but not inside a pipeline.
Where is it that I have to apply toarray
or todense
?
you can change pandas Series
to arrays using the .values
method.
pipeline.fit(df[0].values, df[1].values)
However I think the issue here happens because CountVectorizer()
returns a sparse matrix by default, and cannot be piped to the RF classifier. CountVectorizer()
does have a dtype
parameter to specify the type of array returned. That said usually you need to do some sort of dimensionality reduction to use random forests for text classification, because bag of words feature vectors are very long
Unfortunately those two are incompatible. A CountVectorizer
produces a sparse matrix and the RandomForestClassifier requires a dense matrix. It is possible to convert using X.todense()
. Doing this will substantially increase your memory footprint.
Below is sample code to do this based on http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html which allows you to call .todense()
in a pipeline stage.
class DenseTransformer(TransformerMixin):
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, y=None, **fit_params):
return X.todense()
Once you have your DenseTransformer
, you are able to add it as a pipeline step.
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('to_dense', DenseTransformer()),
('classifier', RandomForestClassifier())
])
Another option would be to use a classifier meant for sparse data like LinearSVC
.
from sklearn.svm import LinearSVC
pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', LinearSVC())])
Random forests in 0.16-dev now accept sparse data.
The most terse solution would be use a FunctionTransformer
to convert to dense: this will automatically implement the fit
, transform
and fit_transform
methods as in David’s answer. Additionally if I don’t need special names for my pipeline steps, I like to use the sklearn.pipeline.make_pipeline
convenience function to enable a more minimalist language for describing the model:
from sklearn.preprocessing import FunctionTransformer
pipeline = make_pipeline(
CountVectorizer(),
FunctionTransformer(lambda x: x.todense(), accept_sparse=True),
RandomForestClassifier()
)
with this pipeline add TfidTransformer plus
pipelinEx = Pipeline([('bow',vectorizer),
('tfidf',TfidfTransformer()),
('to_dense', DenseTransformer()),
('classifier',classifier)])
The first line above, gets the word counts for the documents in a sparse matrix form. However, in practice, you may be computing tfidf scores with TfidfTransformer on a set of new unseen documents.
Then, by calling tfidf transformer.transform(vectorizer) you will finally be computing the tf-idf scores for your docs. Internally this is computing the tf * idf multiplication where term frequency is weighted by its idf values.
I found that FunctionTransformer and using x.toarray() rather than x.todense() worked for me.
'pipeline': Pipeline(
[
('vect', TfidfVectorizer()),
('dense', FunctionTransformer(lambda x: x.toarray(), accept_sparse=True)),
('clf', GaussianProcessClassifier())
]
)