Accuracy of same validation dataset differs between last epoch and after fit
Question:
The following code gives a log ending with
Epoch 19/20
1/1 [==============================] - 0s 473ms/step - loss: 1.4018 - accuracy: 0.8750 - val_loss: 1.8656 - val_accuracy: 0.8900
Epoch 20/20
1/1 [==============================] - 0s 444ms/step - loss: 0.5904 - accuracy: 0.8750 - val_loss: 2.1255 - val_accuracy: 0.8700
get_dataset: validation
Found 1000 files belonging to 2 classes.
Using 100 files for validation.
4/4 [==============================] - 1s 81ms/step
eval acc: 0.81
My question is:
Why is the val_accuracy
after the last epoch (0.87) different from the eval acc
(0.81) after the fit?
In my code, I try to use the same dataset for the validation of each epoch during fit and the additional validation afterwards.
[Update 1, 2022-07-19:
- Obviously, the two accuracy calculations don’t really use the same data. How can I debug which data is actually used?
[Update 3, 2022-07-20: I have followed the data into TensorFlow. The last thing I see is that in Model.evaluate
(during fit
) and Model.predict
the x.filenames
are equal. I did not manage to debug much further, because soon in quick_execute
the __inference_test_function_248219
resp. the __inference_predict_function_231438
are evaluated outside Python, and the arguments are tensors with dtype=resource
, whose contents I cannot see.]
- I have deliberately removed my class balancing code to keep my example small. I know that this makes the accuracies less useful, but I don’t care about that for now.
- Note that
get_dataset('validation')
is only called once at the beginning of the fit, not at each epoch.
- I have now also set
max_queue_size=0, use_multiprocessing=False, workers=0
(as seen here, found via this related SO question about TensorFlow 1), but this did not make the accuracies equal.
]
Code:
import tensorflow as tf
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.preprocessing import image_dataset_from_directory
inputs = tf.keras.Input(shape=(224, 224, 3))
base_model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
base_output = base_model(inputs)
base_model.trainable = False
out = Flatten(name='flat')(base_output)
out = Dense(1, activation='sigmoid')(out)
model = Model(inputs=inputs, outputs=out)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
def get_dataset(subset):
print('get_dataset:', subset)
return image_dataset_from_directory(
'data-nodup-1000',
labels="inferred",
label_mode='binary',
color_mode="rgb",
image_size=(224, 224),
shuffle=True,
seed=1,
validation_split=0.1,
subset=subset,
crop_to_aspect_ratio=False,
)
model.fit(
get_dataset('training'),
steps_per_epoch=1,
epochs=20,
validation_data=get_dataset('validation'),
max_queue_size=0,
use_multiprocessing=False,
workers=0,
)
val_dataset = get_dataset('validation')
true_class = tf.concat([y for x, y in val_dataset], axis=0)
pred = model.predict(val_dataset)
pred_class = pred >= .5
print('eval acc:', accuracy_score(true_class, pred_class))
[Update 2, 2022-07-19:
I can also reproduce the behavior with the deprecated ImageDataGenerator
, using
from tensorflow.keras.applications.resnet50 import preprocess_input
from keras_preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
preprocessing_function=preprocess_input,
validation_split=0.1,
)
def get_dataset(subset):
print('get_dataset:', subset)
return datagen.flow_from_directory(
'data-nodup-1000',
class_mode='binary',
target_size=(224, 224),
shuffle=True,
seed=1,
subset=subset,
)
and
true_class = val_dataset.labels
]
[Update 4, 2022-07-21: Note that deactivating shuffling of validation data by setting shuffle=(subset == 'training')
makes the two validation accuracies equal. This is not a workaround, however, because the validation set then consists only of class 1, since flow_from_directory
doesn’t do stratification.
]
My environment:
- I am using all up-to-date libraries, like tensorflow 2.9.1 and sklearn 1.1.1 (via
pip-compile -U
).
- The folder
data-nodup-1000
contains one subfolder with 113 files of class 0, and one subfolder with 887 files of class 1.
Answers:
there are a few points about your data which causes this:
- First, your data is highly imbalanced (8 to 1 label ratio) which makes the model rather overfit and the CV estimate inaccurate.
- Second, in the
get_dataset
function, the shuffle
is set to True
so every time you call the get_dataset()
, it shuffles your data, and because (1) Your validation set is very small and (2) your train/val split is not stratified over your labels, the validation metrics would vary a lot due to this shuffling.
Suggestions to solve this:
- call the
get_dataset()
only once for train and val dataset before fitting the model and save them as variables. and if there is no sequential order in your data, maybe set shuffle=False
.
- (optional) If possible make your dataset more balanced by techniques such as data augmentation, over-/under-sampling, etc.
def get_dataset(subset):
return image_dataset_from_directory(
'data-nodup-1000',
labels="inferred",
label_mode='binary',
color_mode="rgb",
image_size=(224, 224),
shuffle=False,
seed=0,
validation_split=0.1,
subset=subset,
crop_to_aspect_ratio=False,
)
train_dataset = get_dataset('training')
val_dataset = get_dataset('validation')
model.fit(
train_dataset,
steps_per_epoch=1,
epochs=20,
validation_data=val_dataset,
)
true_class = tf.concat([y for x, y in val_dataset], axis=0)
pred = model.predict(val_dataset)
pred_class = pred >= .5
print('eval acc:', accuracy_score(true_class, pred_class))
I have now found out that in TensorFlow 2.9.1 model.predict
uses the second iteration of the dataset, which is shuffled differently than the first iteration!
It even uses the second iteration when I directly call model.predict(get_dataset('validation'))
!
Therefore, the entries of true_class
and pred
do not match.
Switching to TensorFlow 2.10.0-rc3 and its tf.keras.utils.split_dataset
makes the accuracies equal.
Here’s the updated code:
import tensorflow as tf
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.preprocessing import image_dataset_from_directory
inputs = tf.keras.Input(shape=(224, 224, 3))
base_model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
base_output = base_model(inputs)
base_model.trainable = False
out = Flatten(name='flat')(base_output)
out = Dense(1, activation='sigmoid')(out)
model = Model(inputs=inputs, outputs=out)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
dataset = image_dataset_from_directory(
'data-synthetic',
labels="inferred",
label_mode='binary',
color_mode="rgb",
image_size=(224, 224),
shuffle=True,
seed=1,
crop_to_aspect_ratio=False,
)
train_dataset, val_dataset = tf.keras.utils.split_dataset(dataset, right_size=0.1)
model.fit(
train_dataset,
steps_per_epoch=1,
epochs=20,
validation_data=val_dataset,
max_queue_size=0,
use_multiprocessing=False,
workers=0,
)
true_class = tf.concat([y for x, y in val_dataset], axis=0)
pred = model.predict(val_dataset)
pred_class = pred >= .5
print('eval acc:', accuracy_score(true_class, pred_class))
which correctly yields:
Epoch 19/20
1/1 [==============================] - 0s 438ms/step - loss: 0.4426 - accuracy: 0.9062 - val_loss: 0.4658 - val_accuracy: 0.8800
Epoch 20/20
1/1 [==============================] - 0s 444ms/step - loss: 2.1619 - accuracy: 0.8438 - val_loss: 0.5886 - val_accuracy: 0.8900
4/4 [==============================] - 1s 87ms/step
eval acc: 0.89
The following code gives a log ending with
Epoch 19/20
1/1 [==============================] - 0s 473ms/step - loss: 1.4018 - accuracy: 0.8750 - val_loss: 1.8656 - val_accuracy: 0.8900
Epoch 20/20
1/1 [==============================] - 0s 444ms/step - loss: 0.5904 - accuracy: 0.8750 - val_loss: 2.1255 - val_accuracy: 0.8700
get_dataset: validation
Found 1000 files belonging to 2 classes.
Using 100 files for validation.
4/4 [==============================] - 1s 81ms/step
eval acc: 0.81
My question is:
Why is the val_accuracy
after the last epoch (0.87) different from the eval acc
(0.81) after the fit?
In my code, I try to use the same dataset for the validation of each epoch during fit and the additional validation afterwards.
[Update 1, 2022-07-19:
- Obviously, the two accuracy calculations don’t really use the same data. How can I debug which data is actually used?
[Update 3, 2022-07-20: I have followed the data into TensorFlow. The last thing I see is that inModel.evaluate
(duringfit
) andModel.predict
thex.filenames
are equal. I did not manage to debug much further, because soon inquick_execute
the__inference_test_function_248219
resp. the__inference_predict_function_231438
are evaluated outside Python, and the arguments are tensors withdtype=resource
, whose contents I cannot see.] - I have deliberately removed my class balancing code to keep my example small. I know that this makes the accuracies less useful, but I don’t care about that for now.
- Note that
get_dataset('validation')
is only called once at the beginning of the fit, not at each epoch. - I have now also set
max_queue_size=0, use_multiprocessing=False, workers=0
(as seen here, found via this related SO question about TensorFlow 1), but this did not make the accuracies equal.
]
Code:
import tensorflow as tf
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.preprocessing import image_dataset_from_directory
inputs = tf.keras.Input(shape=(224, 224, 3))
base_model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
base_output = base_model(inputs)
base_model.trainable = False
out = Flatten(name='flat')(base_output)
out = Dense(1, activation='sigmoid')(out)
model = Model(inputs=inputs, outputs=out)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
def get_dataset(subset):
print('get_dataset:', subset)
return image_dataset_from_directory(
'data-nodup-1000',
labels="inferred",
label_mode='binary',
color_mode="rgb",
image_size=(224, 224),
shuffle=True,
seed=1,
validation_split=0.1,
subset=subset,
crop_to_aspect_ratio=False,
)
model.fit(
get_dataset('training'),
steps_per_epoch=1,
epochs=20,
validation_data=get_dataset('validation'),
max_queue_size=0,
use_multiprocessing=False,
workers=0,
)
val_dataset = get_dataset('validation')
true_class = tf.concat([y for x, y in val_dataset], axis=0)
pred = model.predict(val_dataset)
pred_class = pred >= .5
print('eval acc:', accuracy_score(true_class, pred_class))
[Update 2, 2022-07-19:
I can also reproduce the behavior with the deprecated ImageDataGenerator
, using
from tensorflow.keras.applications.resnet50 import preprocess_input
from keras_preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
preprocessing_function=preprocess_input,
validation_split=0.1,
)
def get_dataset(subset):
print('get_dataset:', subset)
return datagen.flow_from_directory(
'data-nodup-1000',
class_mode='binary',
target_size=(224, 224),
shuffle=True,
seed=1,
subset=subset,
)
and
true_class = val_dataset.labels
]
[Update 4, 2022-07-21: Note that deactivating shuffling of validation data by setting shuffle=(subset == 'training')
makes the two validation accuracies equal. This is not a workaround, however, because the validation set then consists only of class 1, since flow_from_directory
doesn’t do stratification.
]
My environment:
- I am using all up-to-date libraries, like tensorflow 2.9.1 and sklearn 1.1.1 (via
pip-compile -U
). - The folder
data-nodup-1000
contains one subfolder with 113 files of class 0, and one subfolder with 887 files of class 1.
there are a few points about your data which causes this:
- First, your data is highly imbalanced (8 to 1 label ratio) which makes the model rather overfit and the CV estimate inaccurate.
- Second, in the
get_dataset
function, theshuffle
is set toTrue
so every time you call theget_dataset()
, it shuffles your data, and because (1) Your validation set is very small and (2) your train/val split is not stratified over your labels, the validation metrics would vary a lot due to this shuffling.
Suggestions to solve this:
- call the
get_dataset()
only once for train and val dataset before fitting the model and save them as variables. and if there is no sequential order in your data, maybe setshuffle=False
. - (optional) If possible make your dataset more balanced by techniques such as data augmentation, over-/under-sampling, etc.
def get_dataset(subset):
return image_dataset_from_directory(
'data-nodup-1000',
labels="inferred",
label_mode='binary',
color_mode="rgb",
image_size=(224, 224),
shuffle=False,
seed=0,
validation_split=0.1,
subset=subset,
crop_to_aspect_ratio=False,
)
train_dataset = get_dataset('training')
val_dataset = get_dataset('validation')
model.fit(
train_dataset,
steps_per_epoch=1,
epochs=20,
validation_data=val_dataset,
)
true_class = tf.concat([y for x, y in val_dataset], axis=0)
pred = model.predict(val_dataset)
pred_class = pred >= .5
print('eval acc:', accuracy_score(true_class, pred_class))
I have now found out that in TensorFlow 2.9.1 model.predict
uses the second iteration of the dataset, which is shuffled differently than the first iteration!
It even uses the second iteration when I directly call model.predict(get_dataset('validation'))
!
Therefore, the entries of true_class
and pred
do not match.
Switching to TensorFlow 2.10.0-rc3 and its tf.keras.utils.split_dataset
makes the accuracies equal.
Here’s the updated code:
import tensorflow as tf
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.preprocessing import image_dataset_from_directory
inputs = tf.keras.Input(shape=(224, 224, 3))
base_model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False)
base_output = base_model(inputs)
base_model.trainable = False
out = Flatten(name='flat')(base_output)
out = Dense(1, activation='sigmoid')(out)
model = Model(inputs=inputs, outputs=out)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
dataset = image_dataset_from_directory(
'data-synthetic',
labels="inferred",
label_mode='binary',
color_mode="rgb",
image_size=(224, 224),
shuffle=True,
seed=1,
crop_to_aspect_ratio=False,
)
train_dataset, val_dataset = tf.keras.utils.split_dataset(dataset, right_size=0.1)
model.fit(
train_dataset,
steps_per_epoch=1,
epochs=20,
validation_data=val_dataset,
max_queue_size=0,
use_multiprocessing=False,
workers=0,
)
true_class = tf.concat([y for x, y in val_dataset], axis=0)
pred = model.predict(val_dataset)
pred_class = pred >= .5
print('eval acc:', accuracy_score(true_class, pred_class))
which correctly yields:
Epoch 19/20
1/1 [==============================] - 0s 438ms/step - loss: 0.4426 - accuracy: 0.9062 - val_loss: 0.4658 - val_accuracy: 0.8800
Epoch 20/20
1/1 [==============================] - 0s 444ms/step - loss: 2.1619 - accuracy: 0.8438 - val_loss: 0.5886 - val_accuracy: 0.8900
4/4 [==============================] - 1s 87ms/step
eval acc: 0.89