Removing all duplicate images with different filenames from a directory

Question

I am trying to iterate through a folder and delete any file that is a duplicate image (but different name). After running this script all files get deleted except for one. There are at least a dozen unique ones out of about 5,000. Any help understanding why this is happening would be appreciated.

import os
import cv2 

directory = r'C:UsersGridscratch'
 
for filename in os.listdir(directory):
    a=directory+'\'+filename
    n=(cv2.imread(a))
    q=0
    for filename in os.listdir(directory):
        b=directory+'\'+filename
        m=(cv2.imread(b))
        comparison = n == m
        equal_arrays = comparison.all()
        if equal_arrays==True and q==1:
            os.remove(b)
        q=1

Asked By: Kuba

||

Source

Answer 1

There are a few issues with your code, and it’s confusing that it could run at all without throwing an exception, since the comparison variable is a boolean, so calling comparison.all() shouldn’t work.

A few pointers: You only need to get the directory contents once. It also would be much simpler to collect md5 or sha1 hashes of the files while iterating the directory and then remove duplicates along the way.

for example:

import hashlib
import os

hashes = set()

for filename in os.listdir(directory):
    path = os.path.join(directory, filename)
    digest = hashlib.sha1(open(path,'rb').read()).digest()
    if digest not in hashes:
        hashes.add(digest)
    else:
        os.remove(path)

You can use a more secure hash if you would like but the chances of encountering a collision are astronomically low.

Answered By: Alexander

Removing all duplicate images with different filenames from a directory

Question:

Answers: