UnidentifiedImageError: cannot identify image file

Question

Hello I am training a model with TensorFlow and Keras, and the dataset was downloaded from https://www.microsoft.com/en-us/download/confirmation.aspx?id=54765

This is a zip folder that I split in the following directories:

.
├── test
│   ├── Cat
│   └── Dog
└── train
    ├── Cat
    └── Dog

Test.cat and test.dog have each folder 1000 jpg photos, and train.cat and traing.dog have each folder 11500 jpg photos.

The load is doing with this code:

batch_size = 16

# Data augmentation and preprocess
train_datagen = ImageDataGenerator(rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    validation_split=0.20) # set validation split

# Train dataset
train_generator = train_datagen.flow_from_directory(
    'PetImages/train',
    target_size=(244, 244),
    batch_size=batch_size,
    class_mode='binary',
    subset='training') # set as training data

# Validation dataset
validation_generator = train_datagen.flow_from_directory(
    'PetImages/train',
    target_size=(244, 244),
    batch_size=batch_size,
    class_mode='binary',
    subset='validation') # set as validation data

test_datagen = ImageDataGenerator(rescale=1./255)
# Test dataset
test_datagen = test_datagen.flow_from_directory(
    'PetImages/test')

THe model is training with the following code:

history = model.fit(train_generator,
                    validation_data=validation_generator,
                    epochs=5)

And i get the following input:

Epoch 1/5
1150/1150 [==============================] - ETA: 0s - loss: 0.0505 - accuracy: 0.9906

But when the epoch is in this point I get the following error:

UnidentifiedImageError: cannot identify image file <_io.BytesIO object
at 0x7f9e185347d0>

How can I solve this, in order to finish the training?

Thanks

Asked By: Tlaloc-ES

||

Source

Answer 1

Try this function to check if the image are all in correct format.

import os
from PIL import Image
folder_path = 'dataimg'
extensions = []
for fldr in os.listdir(folder_path):
    sub_folder_path = os.path.join(folder_path, fldr)
    for filee in os.listdir(sub_folder_path):
        file_path = os.path.join(sub_folder_path, filee)
        print('** Path: {}  **'.format(file_path), end="r", flush=True)
        im = Image.open(file_path)
        rgb_im = im.convert('RGB')
        if filee.split('.')[1] not in extensions:
            extensions.append(filee.split('.')[1])

Answered By: Aniket Bote

Answer 2

I have run into this problem previously. So I developed a python script to test the training and test directories for valid image files. File extensions must be one of jpg, png, bmp or gif so it checks for proper extensions first. Then it tries to read in the image using cv2. If it does not input a valid image an exception is created. In each case the bad file name is printed out. At the conclusion a list called bad_list contains the list of bad file paths. Note directories must be name ‘test’ and ‘train’

import os
import cv2
bad_list=[]
dir=r'c:'PetImages'
subdir_list=os.listdir(dir) # create a list of the sub directories in the directory ie train or test
for d in subdir_list:  # iterate through the sub directories train and test
    dpath=os.path.join (dir, d) # create path to sub directory
    if d in ['test', 'train']:
        class_list=os.listdir(dpath) # list of classes ie dog or cat
       # print (class_list)
        for klass in class_list: # iterate through the two classes
            class_path=os.path.join(dpath, klass) # path to class directory
            #print(class_path)
            file_list=os.listdir(class_path) # create list of files in class directory
            for f in file_list: # iterate through the files
                fpath=os.path.join (class_path,f)
                index=f.rfind('.') # find index of period infilename
                ext=f[index+1:] # get the files extension
                if ext  not in ['jpg', 'png', 'bmp', 'gif']:
                    print(f'file {fpath}  has an invalid extension {ext}')
                    bad_list.append(fpath)                    
                else:
                    try:
                        img=cv2.imread(fpath)
                        size=img.shape
                    except:
                        print(f'file {fpath} is not a valid image file ')
                        bad_list.append(fpath)
                       
print (bad_list)

Answered By: Gerry P

Answer 3

You may have an image that is corrupt. In the data preprocessing step, try to use Image.open() to see if all the images can be opened.

Answered By: Laur

Answer 4

I don’t know if this still relevant, but for people who will encounter the same problem in the future:

In this specific situation, there are two corrupted files in the dog_cat dataset:

cats/666.jpg
dogs/11702.jpg

Just remove them and it will work.

Answered By: I. Ali

Answer 5

Instead of appending the corrupted list we can just delete at every instance of the error too…

import os
from PIL import Image
folder_path = r"C:UsersImageDatasets"
extensions = []
corupt_img_paths=[]
for fldr in os.listdir(folder_path):
    sub_folder_path = os.path.join(folder_path, fldr)
    for filee in os.listdir(sub_folder_path):
        file_path = os.path.join(sub_folder_path, filee)
        print('** Path: {}  **'.format(file_path), end="r", flush=True)
        try:
            im = Image.open(file_path)
        except:
            print(file_path)
            os.remove(file_path)
            continue
        else:
            rgb_im = im.convert('RGB')
            if filee.split('.')[1] not in extensions:
                extensions.append(filee.split('.')[1])

Answered By: KRISHNENDU

UnidentifiedImageError: cannot identify image file

Question:

Answers: