Find duplicate images in fastest way

Question

I have 2 image folder containing 10k and 35k images. Each image is approximately the size of (2k,2k).
I want to remove the images which are exact duplicates.
The variation in different images are just a change in some pixels.
I have tried DHashing, PHashing, AHashing but as they are lossy image hashing technique so they are giving the same hash for non-duplicate images too.
I also tried writing a code in python, which will just subtract images and the combination in which the resultant array is not zero everywhere gives those image pair to be duplicate of each other.
Buth the time for a single combination is 0.29 seconds and for total 350 million combinations is really huge.
Is there a way to do it in a faster way without flagging non-duplicate images also.
I am open to doing it in any language(C,C++), any approach(distributed computing, multithreading) which can solve my problem accurately.
Apologies if I added some of the irrelevant approaches as I am not from computer science background.
Below is the code I used for python approach –

start = timeit.default_timer()
dict = {}
for i in path1:
    img1 = io.imread(i)
    base1 = os.path.basename(i)
    for j in path2:
        img2 = io.imread(j)
        base2 = os.path.basename(j)
        if np.array_equal(img1, img2):
            err  = img1.astype('float') - img2.astype('float')
            is_all_zero = np.all((err == 0))
            if is_all_zero:
                dict[base1] = base2
            else:
                continue
stop = timeit.default_timer()
print('Time: ', stop - start)

Asked By: Bing

||

Source

Answer 1

You should find the answer on how to delete duplicate files (not only images). Then you can use, for example, fdupes or find some alternative SW: https://alternativeto.net/software/fdupes/

Answered By: Ihor Drachuk

Answer 2

Use lossy hashing as a prefiltering step, before a complete comparison. You can also generate thumbnail images (say 12 x 8 pixels), and compare for similarity.

The idea is to perform quick rejection of very different images.

Answered By: Yves Daoust

Answer 3

This code checks if there are any duplicates in a folder (it’s a bit slow though):

    import image_similarity_measures
    from image_similarity_measures.quality_metrics import rmse, psnr
    from sewar.full_ref import rmse, psnr
    import cv2
    import os 
    import time



    def check(path_orginal,path_new):#give r strings 
        original = cv2.imread(path_orginal)
        new = cv2.imread(path_new)
        return  rmse(original, new)

    def folder_check(folder_path):
        i=0
        file_list = os.listdir(folder_path)
        print(file_list)
        duplicate_dict={}
        for file in file_list:
            # print(file)
            file_path=os.path.join(folder_path,file)
            for file_compare in file_list:
                print(i)
                i+=1
                file_compare_path=os.path.join(folder_path,file_compare)
                if file_compare!=file:
                    similarity_score=check(file_path,file_compare_path)
                    # print(str(similarity_score))
            
                    if similarity_score==0.0:
                         print(file,file_compare)
                         duplicate_dict[file]=file_compare
            file_list.remove(str(file))
        return duplicate_dict
     start_time=time.time()
     print(folder_check(r"C:UsersAdminLinear-Regression-1image-similarity-measuresinput1"))
     end_time=time.time()
     stamp=end_time-start_time
     print(stamp)

Answered By: Balagopal

Find duplicate images in fastest way

Question:

Answers: