sklearn2pmml omits field names

Question:

I export an instance of sklearn.preprocessing.StandardScaler into a pmml-file. The problem is, that the names of the fields do not appear in the pmml-file, e.g. when using the iris dataset then the original field names ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)'] do not appear. Instead only names like x1,x2, etc appear. Is there a way to get the original field names in the pmml-file?
The Following code should be runnable:

from sklearn2pmml import sklearn2pmml, PMMLPipeline, make_pmml_pipeline  
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandas as pd
data = load_iris()
dfIris = pd.DataFrame(data=data.data, columns=data.feature_names)

ssModel = StandardScaler()
ssModel.fit(dfIris)


pipe = PMMLPipeline([("StandardScaler", ssModel)])
sklearn2pmml(pipeline=make_pmml_pipeline(pipe), pmml="ssIris.pmml")

In the ssIris.pmml I see this:
enter image description here

Asked By: dba

||

Answers:

First, I believe you want to fit the PMMLPipeline after initialization so you may use pipe.fit(dfIris) instead of fitting before the ssModel. To preserve the column names add a none preprocessing function that uses DataFrameMapper to map pandas data frame columns to different sklearn transformations before the scaler, as the pipeline expects a preprocessing function in order to keep the column names. I am not sure whether this is the best way but I checked it and it was preserving the column names.

from sklearn_pandas import DataFrameMapper
from sklearn2pmml import sklearn2pmml, PMMLPipeline, make_pmml_pipeline
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandas as pd
data = load_iris()
dfIris = pd.DataFrame(data=data.data, columns=data.feature_names)

ssModel = StandardScaler()
pipe.fit(dfIris)

pipe = PMMLPipeline([("df_mapper", 
                  DataFrameMapper([(d, None) for d in data.feature_names], 
df_out=True)), ("StandardScaler", ssModel)])
pipe.fit(dfIris)
sklearn2pmml(pipeline=make_pmml_pipeline(pipe), pmml="ssIris.pmml")
Answered By: Jehona Kryeziu

The only component that comes in contact with dfIris data frame (holds feature name information) is the StandardScaler.fit(X) method. This method does not collect or store incoming feature names in any way.

The SkLearn2PMML package gets feature names from the value of the PMMLPipeline.active_fields attribute. Right now it’s missing, which is why SkLearn2PMML falls back to default feature names "x1", "x2", .., "xn".

This attribute is automatically set during the PMMLPipeline.fit(X, y) method invocation. Alternatively, you may set/reset this attribute manually at any time.

If you’re constructing a PMMLPipeline object using the sklearn2pmml.make_pmml_pipeline utility method, then this method also takes active_fields and target_fields arguments. Please note that in your example code you have a manually constructed PMMLPipeline object, which you then wrap into a new PMMLPipeline object using this utility function call. This is redundant, and actually masks any feature/target names that were possibly set there.

A much better example:

from pandas import DataFrame 
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn2pmml import sklearn2pmml, PMMLPipeline 

data = load_iris()

iris_X = DataFrame(data = data.data, columns = data.feature_names)
iris_y = None

pipeline = PMMLPipeline([
    ("ss", StandardScaler())
])
pipeline.fit(iris_X, iris_y)

sklearn2pmml(pipeline, "ssIris.pmml")
Answered By: user1808924