Removing all duplicate images with different filenames from a directory

Question:

I am trying to iterate through a folder and delete any file that is a duplicate image (but different name). After running this script all files get deleted except for one. There are at least a dozen unique ones out of about 5,000. Any help understanding why this is happening would be appreciated.

import os
import cv2 

directory = r'C:UsersGridscratch'
 
for filename in os.listdir(directory):
    a=directory+'\'+filename
    n=(cv2.imread(a))
    q=0
    for filename in os.listdir(directory):
        b=directory+'\'+filename
        m=(cv2.imread(b))
        comparison = n == m
        equal_arrays = comparison.all()
        if equal_arrays==True and q==1:
            os.remove(b)
        q=1
Asked By: Kuba

||

Answers:

There are a few issues with your code, and it’s confusing that it could run at all without throwing an exception, since the comparison variable is a boolean, so calling comparison.all() shouldn’t work.

A few pointers: You only need to get the directory contents once. It also would be much simpler to collect md5 or sha1 hashes of the files while iterating the directory and then remove duplicates along the way.

for example:

import hashlib
import os

hashes = set()

for filename in os.listdir(directory):
    path = os.path.join(directory, filename)
    digest = hashlib.sha1(open(path,'rb').read()).digest()
    if digest not in hashes:
        hashes.add(digest)
    else:
        os.remove(path)

You can use a more secure hash if you would like but the chances of encountering a collision are astronomically low.

Answered By: Alexander
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.