How to use polars dataframes with scikit-learn?

Question:

I’m unable to use polars dataframes with scikitlearn for ML training.

Currently I’m doing all the dataframe preprocessing in polars and during model training i’m converting it into a pandas one in order for it to work.

Is there any method to directly use polars dataframe as it is for ML training without changing it to pandas?

Asked By: RKCH

||

Answers:

You must call to_numpy when passing a DataFrame to sklearn. Though sometimes sklearn can work on polars Series it is still good type hygiene to transform to the type the host library expects.

import polars as pl
from sklearn.linear_model import LinearRegression

data = pl.DataFrame(
    np.random.randn(100, 5)
)

x = data.select([
    pl.all().exclude("column_0"),
])

y = data.select(pl.col("column_0").alias("y"))


x_train = x[:80]
y_train = y[:80]

x_test = x[80:]
y_test = y[80:]


m = LinearRegression()

m.fit(X=x_train.to_numpy(), y=y_train.to_numpy())
m.predict(x_test.to_numpy())
Answered By: ritchie46
encoding_transformer1 = ColumnTransformer(
    [("Normalizer", Normalizer(), ['Age', 'Fare']),
     ("One-hot encoder",
      OneHotEncoder(dtype=int, handle_unknown='infrequent_if_exist'),
      ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'])],
    n_jobs=-1,
    verbose=True,
    verbose_feature_names_out=True)

encoding_transformer1.fit(xtrain)
train_data = encoding_transformer1.transform(xtrain).tocsr()
test_data = encoding_transformer1.transform(xtest).tocsr()

I’m getting this error:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

what should i do?

Answered By: RKCH