Disagreement in confusion matrix and accuracy when using data generator

Question:

I was working on a model
based on the following code

epoch=100
model_history = model.fit(train_generator, 
epochs=epoch,
validation_data=test_generator,
callbacks=[model_es, model_rlr, model_mcp])

After model training when I evaluated the model using the following code, I get an accuracy of 98.3%

model.evaluate(test_generator)

41/41 [==============================] – 3s 68ms/step – loss: 0.0396 – accuracy: 0.9893
[0.039571091532707214, 0.9893211126327515]

In order to analyse the result, I tried to obtain a confusion matrix of the test_generator using the following code

y_pred = model.predict(test_generator)
y_pred = np.argmax(y_pred, axis=1)
print(confusion_matrix(test_generator.classes, y_pred))

However the output is

[[ 68  66  93  73]
 [ 64  65  93  84]
 [ 91 102 126  86]
 [ 69  75  96  60]]

which highly disagrees with the model_evaluate

Can anyone help me out here to obtain the actual confusion matrix for the model

plot history of model accuracy

Entire code: https://colab.research.google.com/drive/1wpoPjnSoCqVaA–N04dcUG6A5NEVcufk?usp=sharing

Asked By: Raja Singh

||

Answers:

Here is the code to predict the accuracy, confusion matrix and classification report

def predictor(test_gen):    
    y_pred= []
    error_list=[]
    error_pred_list = []
    y_true=test_gen.labels
    classes=list(test_gen.class_indices.keys())
    class_count=len(classes)
    errors=0
    preds=model.predict(test_gen, verbose=1)
    tests=len(preds)    
    for i, p in enumerate(preds):        
        pred_index=np.argmax(p)         
        true_index=test_gen.labels[i]  # labels are integer values        
        if pred_index != true_index: # a misclassification has occurred                                           
            errors=errors + 1
            file=test_gen.filenames[i]
            error_list.append(file)
            error_class=classes[pred_index]
            error_pred_list.append(error_class)
        y_pred.append(pred_index)
            
    acc=( 1-errors/tests) * 100
    msg=f'there were {errors} errors in {tests} tests for an accuracy of {acc:6.2f}'
    print(msg)
    ypred=np.array(y_pred)
    ytrue=np.array(y_true)
    f1score=f1_score(ytrue, ypred, average='weighted')* 100
    if class_count <=30:
        cm = confusion_matrix(ytrue, ypred )
        # plot the confusion matrix
        plt.figure(figsize=(12, 8))
        sns.heatmap(cm, annot=True, vmin=0, fmt='g', cmap='Blues', cbar=False)       
        plt.xticks(np.arange(class_count)+.5, classes, rotation=90)
        plt.yticks(np.arange(class_count)+.5, classes, rotation=0)
        plt.xlabel("Predicted")
        plt.ylabel("Actual")
        plt.title("Confusion Matrix")
        plt.show()
    clr = classification_report(y_true, y_pred, target_names=classes, digits= 4) # create classification report
    print("Classification Report:n----------------------n", clr)
    return errors, tests, error_list, error_pred_list, f1score

errors, tests, error_list, error_pred_list, f1score =predictor(test_gen)

# print out list of test files misclassified if less than 50 errors

if len(error_list) > 0 and len(error_list)<50:
    print ('Below is a list of test files that were miss classified n')
    print ('{0:^30s}{1:^30s}'.format('Test File', ' Predicted as'))
    sorted_list=sorted(error_list)
    for i in range(len(sorted_list)):
        fpath=sorted_list[i]        
        split=fpath.split('\')        
        f=split[4]+ '-' + split[5]
        print(f'{f:^30s}{error_pred_list[i]:^30s}')
Answered By: Gerry P

From your code, change:

test_generator=train_datagen.flow_from_directory(
    locat_testing,
    class_mode='binary',
    color_mode='grayscale',
    batch_size=32,
    target_size=(img_size,img_size)
)

To include the shuffle parameter:

test_generator=train_datagen.flow_from_directory(
    locat_testing,
    class_mode='binary',
    color_mode='grayscale',
    batch_size=32,
    target_size=(img_size,img_size),
    shuffle=False
)

Your confusion matrix will look a lot more accurate instead of what looks like randomly guessing.

Answered By: Djinn