AttributeError: 'Pipeline' object has no attribute 'partial_fit'

Question:

I am trying to train my binary classifier over a huge data. Previously, I could accomplish training via using fit method of sklearn. But now, I have more data and I cannot cope with them. I am trying to fitting them partially but couldn’t get rid of errors. How can I train my huge data incrementally? With applying my previous approach, I get an error about pipeline object. I have gone through the examples from Incremental Learning but still running these code samples gives error. I will appreciate any help.

X,y = transform_to_dataset(training_data)

clf = Pipeline([
    ('vectorizer', DictVectorizer()),
    ('classifier', LogisticRegression())])

length=len(X)/2

clf.partial_fit(X[:length],y[:length],classes=np.array([0,1]))

clf.partial_fit(X[length:],y[length:],classes=np.array([0,1]))

ERROR

AttributeError: 'Pipeline' object has no attribute 'partial_fit'

TRYING GIVEN CODE SAMPLES:

clf=SGDClassifier(alpha=.0001, loss='log', penalty='l2', n_jobs=-1,
                      #shuffle=True, n_iter=10, 
                      verbose=1)
length=len(X)/2

clf.partial_fit(X[:length],y[:length],classes=np.array([0,1]))

clf.partial_fit(X[length:],y[length:],classes=np.array([0,1]))

ERROR

File "/home/kntgu/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 573, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/home/kntgu/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
TypeError: float() argument must be a string or a number

My dataset consists of some sentences with their part of speech tags and dependency relations.

Thanks  NN  0   root
to  IN  3   case
all DT  1   nmod
who WP  5   nsubj
volunteered VBD 3   acl:relcl
.   .   1   punct

You PRP 3   nsubj
will    MD  3   aux
remain  VB  0   root
as  IN  5   case
alternates  NNS 3   obl
.   .   3   punct
Asked By: kntgu

||

Answers:

A Pipeline object from scikit-learn does not have the partial_fit, as seen in the docs.

The reason for this is that you can add any estimator you want to that Pipeline object, and not all of them implement the partial_fit. Here is a list of the supported estimators.

As you see, using SGDClassifier (without Pipeline), you don’t get this "no attribute" error, because this specific estimator is supported. The error message you get for this one is probably due to text data. You can use the LabelEncoder to process the non-numeric columns.

Answered By: BenjaVR

I was going through the same problem as SGDClassifier inside pipeline doesn’t support the incremental learning (i.e. partial_fit param). There is a way we could do incremental learning using sklearn but it is not with partial_fit, it is with warm_start. There are two algorithms in sklearn LogisticRegression and RandomForest that support warm_start.

warm start is another way of doing incremental_learning. read here

Answered By: manish Prasad

pipeline has no attribute partial_fit as there are many models with no partial_fit which can be assigned to the pipeline.
My solution for this is to make a dictionary rather than pipeline and save it as joblib.

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

from sklearn.linear_model import SGDClassifier
model=SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42)

tosave={
    "model":model,
    "count":count_vect,
    "tfid":tfidf_transformer,
}

import joblib
filename = 'package.sav'
joblib.dump(tosave, filename)

Then use

import joblib
filename = 'package.sav'
pack=joblib.load(filename)

pack['model'].partial_fit(X,Y)

Answered By: Imran Khan