X has 232 features, but StandardScaler is expecting 241 features as input

Question:

I want to make a prediction using knn and I have following lines of code:

def knn(trainImages, trainLabels, testImages, testLabels):
    max = 0
    for i in range(len(trainImages)):
        if len(trainImages[i]) > max:
            max = len(trainImages[i])

    for i in range(len(trainImages)):
        aux = np.array(trainImages[i])
        aux.resize(max)
        trainImages[i] = aux

    max = 0
    for i in range(len(testImages)):
        if len(testImages[i]) > max:
            max = len(testImages[i])

    for i in range(len(testImages)):
        aux = np.array(testImages[i])
        aux.resize(max)
        testImages[i] = aux

    scaler = StandardScaler()
    scaler.fit(list(trainImages))

    trainImages = scaler.transform(list(trainImages))
    testImages = scaler.transform(list(testImages))

    classifier = KNeighborsClassifier(n_neighbors=5)
    classifier.fit(trainImages, trainLabels)

    pred = classifier.predict(testImages)

    print(classification_report(testLabels, pred))

I got the error at testImages = scaler.transform(list(testImages)). I understand that its a difference between arrays number. How can I solve it?

Asked By: Fane Spoitoru

||

Answers:

scaler in scikit-learn expects input shape as (n_samples, n_features).
If your second dimension in train and test set is not equal, then not only in sklearn it is incorrect and cause to raise error, but also in theory it does not make sense. n_features dimension of test and train set should be equal, but first dimension can be different, since it show number of samples and you can have any number of samples in train and test sets.

When you execute scaler.transform(test) it expects test have the same feature numbers as where you executed scaler.fit(train). So, all your images should be in the same size.

For example, if you have 100 images, train_images shape should be something like (90,224,224,3) and test_images shape should be like (10,224,224,3) (only first dimension is different).

So, try to resize your images like this:

import cv2
resized_image = cv2.resize(image, (224,224)) #don't include channel dimension
Answered By: Kaveh