Accuracy of same validation dataset differs between last epoch and after fit

Question

The following code gives a log ending with

Epoch 19/20
1/1 [==============================] - 0s 473ms/step - loss: 1.4018 - accuracy: 0.8750 - val_loss: 1.8656 - val_accuracy: 0.8900
Epoch 20/20
1/1 [==============================] - 0s 444ms/step - loss: 0.5904 - accuracy: 0.8750 - val_loss: 2.1255 - val_accuracy: 0.8700
get_dataset: validation
Found 1000 files belonging to 2 classes.
Using 100 files for validation.
4/4 [==============================] - 1s 81ms/step
eval acc: 0.81

My question is:

Why is the val_accuracy after the last epoch (0.87) different from the eval acc (0.81) after the fit?

In my code, I try to use the same dataset for the validation of each epoch during fit and the additional validation afterwards.

[Update 1, 2022-07-19:

Obviously, the two accuracy calculations don’t really use the same data. How can I debug which data is actually used?
[Update 3, 2022-07-20: I have followed the data into TensorFlow. The last thing I see is that in Model.evaluate (during fit) and Model.predict the x.filenames are equal. I did not manage to debug much further, because soon in quick_execute the __inference_test_function_248219 resp. the __inference_predict_function_231438 are evaluated outside Python, and the arguments are tensors with dtype=resource, whose contents I cannot see.]
I have deliberately removed my class balancing code to keep my example small. I know that this makes the accuracies less useful, but I don’t care about that for now.
Note that get_dataset('validation') is only called once at the beginning of the fit, not at each epoch.
I have now also set max_queue_size=0, use_multiprocessing=False, workers=0 (as seen here, found via this related SO question about TensorFlow 1), but this did not make the accuracies equal.

]

Code:

import tensorflow as tf
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.preprocessing import image_dataset_from_directory
    
inputs = tf.keras.Input(shape=(224, 224, 3))
base_model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
base_output = base_model(inputs)
base_model.trainable = False
out = Flatten(name='flat')(base_output)
out = Dense(1, activation='sigmoid')(out)
model = Model(inputs=inputs, outputs=out)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

def get_dataset(subset):
    print('get_dataset:', subset)
    return image_dataset_from_directory(
        'data-nodup-1000',
        labels="inferred",
        label_mode='binary',
        color_mode="rgb",
        image_size=(224, 224),
        shuffle=True,
        seed=1,
        validation_split=0.1,
        subset=subset,
        crop_to_aspect_ratio=False,
    )

model.fit(
    get_dataset('training'),
    steps_per_epoch=1,
    epochs=20,
    validation_data=get_dataset('validation'),
    max_queue_size=0,
    use_multiprocessing=False,
    workers=0,
)

val_dataset = get_dataset('validation')
true_class = tf.concat([y for x, y in val_dataset], axis=0)
pred = model.predict(val_dataset)
pred_class = pred >= .5
print('eval acc:', accuracy_score(true_class, pred_class))

[Update 2, 2022-07-19:
I can also reproduce the behavior with the deprecated ImageDataGenerator, using

from tensorflow.keras.applications.resnet50 import preprocess_input
from keras_preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    preprocessing_function=preprocess_input,
    validation_split=0.1,
)

def get_dataset(subset):
    print('get_dataset:', subset)
    return datagen.flow_from_directory(
        'data-nodup-1000',
        class_mode='binary',
        target_size=(224, 224),
        shuffle=True,
        seed=1,
        subset=subset,
    )

and

true_class = val_dataset.labels

]

[Update 4, 2022-07-21: Note that deactivating shuffling of validation data by setting shuffle=(subset == 'training') makes the two validation accuracies equal. This is not a workaround, however, because the validation set then consists only of class 1, since flow_from_directory doesn’t do stratification.
]

My environment:

I am using all up-to-date libraries, like tensorflow 2.9.1 and sklearn 1.1.1 (via pip-compile -U).
The folder data-nodup-1000 contains one subfolder with 113 files of class 0, and one subfolder with 887 files of class 1.

Asked By: Robert Pollak

||

Source

Answer 1

there are a few points about your data which causes this:

First, your data is highly imbalanced (8 to 1 label ratio) which makes the model rather overfit and the CV estimate inaccurate.
Second, in the get_dataset function, the shuffle is set to True so every time you call the get_dataset(), it shuffles your data, and because (1) Your validation set is very small and (2) your train/val split is not stratified over your labels, the validation metrics would vary a lot due to this shuffling.

Suggestions to solve this:

call the get_dataset() only once for train and val dataset before fitting the model and save them as variables. and if there is no sequential order in your data, maybe set shuffle=False.
(optional) If possible make your dataset more balanced by techniques such as data augmentation, over-/under-sampling, etc.


def get_dataset(subset):
    return image_dataset_from_directory(
        'data-nodup-1000',
        labels="inferred",
        label_mode='binary',
        color_mode="rgb",
        image_size=(224, 224),
        shuffle=False,
        seed=0,
        validation_split=0.1,
        subset=subset,
        crop_to_aspect_ratio=False,
    )

train_dataset = get_dataset('training')
val_dataset = get_dataset('validation')

model.fit(
    train_dataset,
    steps_per_epoch=1,
    epochs=20,
    validation_data=val_dataset,
)


true_class = tf.concat([y for x, y in val_dataset], axis=0)
pred = model.predict(val_dataset)
pred_class = pred >= .5
print('eval acc:', accuracy_score(true_class, pred_class))

Answered By: Ebrahim Pichka

Answer 2

I have now found out that in TensorFlow 2.9.1 model.predict uses the second iteration of the dataset, which is shuffled differently than the first iteration!
It even uses the second iteration when I directly call model.predict(get_dataset('validation'))!

Therefore, the entries of true_class and pred do not match.

Switching to TensorFlow 2.10.0-rc3 and its tf.keras.utils.split_dataset makes the accuracies equal.

Here’s the updated code:

import tensorflow as tf
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.preprocessing import image_dataset_from_directory
    
inputs = tf.keras.Input(shape=(224, 224, 3))
base_model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
base_output = base_model(inputs)
base_model.trainable = False
out = Flatten(name='flat')(base_output)
out = Dense(1, activation='sigmoid')(out)
model = Model(inputs=inputs, outputs=out)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

dataset = image_dataset_from_directory(
    'data-synthetic',
    labels="inferred",
    label_mode='binary',
    color_mode="rgb",
    image_size=(224, 224),
    shuffle=True,
    seed=1,
    crop_to_aspect_ratio=False,
)
train_dataset, val_dataset = tf.keras.utils.split_dataset(dataset, right_size=0.1)

model.fit(
    train_dataset,
    steps_per_epoch=1,
    epochs=20,
    validation_data=val_dataset,
    max_queue_size=0,
    use_multiprocessing=False,
    workers=0,
)

true_class = tf.concat([y for x, y in val_dataset], axis=0)
pred = model.predict(val_dataset)
pred_class = pred >= .5
print('eval acc:', accuracy_score(true_class, pred_class))

which correctly yields:

Epoch 19/20
1/1 [==============================] - 0s 438ms/step - loss: 0.4426 - accuracy: 0.9062 - val_loss: 0.4658 - val_accuracy: 0.8800
Epoch 20/20
1/1 [==============================] - 0s 444ms/step - loss: 2.1619 - accuracy: 0.8438 - val_loss: 0.5886 - val_accuracy: 0.8900
4/4 [==============================] - 1s 87ms/step
eval acc: 0.89

Answered By: Robert Pollak

Accuracy of same validation dataset differs between last epoch and after fit

Question:

Answers: