setting an array element with a sequence after one-hot-encoding with scikit-learn

Question:

I have been using scikit-learn’s linear svc model for a binary classification problem.

Example row from the dataset:

    PassengerId Survived    Pclass  Name                    Sex    Age  SibSp   Parch   Ticket      Fare    Cabin   Embarked
0   1           0           3       Braund, Mr. Owen Harris male    22.0    1   0       A/5 21171   7.25    NaN     S

I transformed the data into numbers using the OneHotEncoder and the ColumnTransformer from scikit-learn:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Name", "Sex", "Ticket", "Cabin", "Embarked"]
encoder = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   encoder,
                                   categorical_features)],
                                   remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

It returned me a scipy.sparse._csr.csr_matrix, so I changed it into a dataframe using:

transformed_X = pd.DataFrame(transformed_X)

Then I resplit the data and fit it to the model

transformed_X_train, transformed_X_test, y_train, y_test = train_test_split(transformed_X,
                                                                            y,
                                                                            test_size=0.2)

from sklearn import svm
clf = svm.SVC()
clf.fit(transformed_X_train, y_train)

Unfortunately, I got an error:

TypeError                                 Traceback (most recent call last)
TypeError: float() argument must be a string or a real number, not 'csr_matrix'

...

ValueError: setting an array element with a sequence.

I tried searching online, but I can’t didn’t find a good solution to my problem.
Can someone please help, because I don’t know what I’m doing wrong. Any help would be appreciated 🙂

Asked By: Yusuf Saad

||

Answers:

I got it! I first filled in the missing data that was in the dataframe before encoding it, then when I one-hot-encoded it I did it with the entire training set, not only the X, like so:

transformed_X = transformer.fit_transform(train)
transformed_X

The difference between the X and the full training set is that X was the training set without the target values (In this case, it was whether they survived or not).

Thanks! 🙂

Answered By: Yusuf Saad