How to Get feature_importance when using sklearn2pmml

Question:

Now i trained a gbdt model named ‘GB’ in python sklearn. And i want to export this trained model into pmml files. But i meet this problem:
1. if i try to put the trained ‘GB’ model into PMMLpipeline and use sklearn2pmml to export the model. like below:

GB = GradientBoostingClassifier(n_estimators=100,learning_rate=0.05)
GB.fit(train[list(x_features),Train['Target']])
GB_pipeline = PMMLPipeline([("classifier",GB)])
sklearn2pmml.sklearn2pmml(GB_pipeline,pmml='GB.pmml')
importance=gb.feature_importances_

there is a warning ‘The ‘active_fields’ attribute is not set’. and i will lose all the features’ names in the exported pmml file.

  1. and if i try to train the model directly in the PMMLPipeline. Since there is no feature_importances_ attribute in the GB_pipeline i cannot observe the features_importance of this model. Like below:

    GB_pipeline = PMMLPipeline([(“classifier”,GradientBoostingClassifier(n_estimators=100,learning_rate=0.05))])
    PMMLPipeline.fit(train[list(x_features),Train[‘Target’]])
    sklearn2pmml.sklearn2pmml(GB_pipeline,pmml=’GB.pmml’)

what shall i do that i can both observe the features_importance of the model and also keep the features’ names in the exported pmml file.
Thank you very much!

Asked By: Noah

||

Answers:

Important points:

  1. Instantiate the classifier outside of pipeline
  2. Instantiate the (PMML-) pipeline, insert this classifier into it.
  3. Fit this pipeline as a whole.
  4. Print the feature importances of this classifier, and export this pipeline into a PMML document.

In your first code example, you’re fitting the classifier, but you should be fitting the pipeline as a whole – hence the warning that the internal state of the pipeline is incomplete. In your second code example, you don’t have a direct reference to the classifier (however, you could obtain it by “parsing” the last step of the fitted pipeline).

A complete example based on the Iris dataset:

import pandas
iris_df = pandas.read_csv("Iris.csv")

from sklearn.ensemble import GradientBoostingClassifier
from sklearn2pmml import sklearn2pmml, PMMLPipeline
gbt = GradientBoostingClassifier()
pipeline = PMMLPipeline([
    ("classifier", gbt)
])
pipeline.fit(iris_df[iris_df.columns.difference(["Species"])], iris_df["Species"])
print (gbt.feature_importances_)
sklearn2pmml(pipeline, "GBTIris.pmml", with_repr = True)
Answered By: user1808924

If you have come here like me to include the importances inside the pipeline from Python to pmml, then I have a good news.

I have tried searching for it on the internet and came to know that: We would have to make the importance field manually in the RF model in python so then it would be able to store them inside the PMML.

TL;DR Here is the code:

# Keep the model object outside which is the trick
RFModel = RandomForestRegressor()

# Make the pipeline as usual
column_trans = ColumnTransformer([
    ('onehot', OneHotEncoder(drop='first'), ["sex", "smoker", "region"]),
    ('Stdscaler', StandardScaler(), ["age", "bmi"]),
    ('MinMxscaler', MinMaxScaler(), ["children"])
])


pipeline = PMMLPipeline([
    ('col_transformer', column_trans),
    ('model', RFModel)
])

# Fit the pipeline
pipeline.fit(X, y)

# Store the importances in the temproary variable
importances = RFModel.feature_importances_

# Assign them in the MODEL ITSELF (The main part)
RFModel.pmml_feature_importances_ = importances

# Finally save the model as usual
sklearn2pmml(pipeline, r"pathfile.pmml")

Now, you will see the importances in the PMML file!!
Reference from: Openscoring

Answered By: Aayush Shah

Another way to do this is by referring to the model in the pmml pipeline, very similar to Aayush Shah answer but we are actually using the PMMLPipeline to see the importances. See bellow:

model = DecisionTreeClassifier()
pmml_pipeline = PMMLPipeline([
     ("preprocessing",preprocessing_step),
  ('decisiontree',model)
])
# access to your model using pmml_pipeline[1] , then call feature importances
pmml_pipeline[1].feature_importances_
Answered By: Tom
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.