Slow performance in hashing images in Python

Question:

I have a function to create image difference hash and stored in a list in Python:

import glob
import dhash
from alive_progress import alive_bar
from wand.image import Image

def get_photo_hashes(dir='./'):
    file_list = {}
    imgs = glob.glob(dir + '*.jpg')
    total = len(list(imgs))
    with alive_bar(total) as bar:
        for i in imgs:
            with Image(filename=i) as image:
                row, col = dhash.dhash_row_col(image)
            hash_val = dhash.format_hex(row, col)
            file_list[i] = hash_val
            bar()
    return file_list

The performance of hashing a folder of 10,000 500kb – 1MB JPEG images are surprisingly slow, around 2 hashes per second. How can I enhance the performance of this function? Thanks.

Asked By: Raptor

||

Answers:

Multiprocessing would ideal for this as the hashing is CPU intensive.

Here’s what I suggest:

import dhash
import glob
from wand.image import Image
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import Manager

PATH = '*.jpg'

def makehash(t):
    filename, d = t
    with Image(filename=filename) as image:
        row, col = dhash.dhash_row_col(image)
        d[filename] = dhash.format_hex(row, col)

def main():
    with Manager() as manager:
        d = manager.dict()
        with ProcessPoolExecutor() as executor:
            executor.map(makehash, [(jpg, d) for jpg in glob.glob(PATH)])
        print(d)

if __name__ == '__main__':
    main()

Some stats:

I have a folder containing 129 JPGs. The average size of each file is >12MB. The net processing time is ~19s

Answered By: Vlad

I like @JCaesar answer a lot, and decided to have a play with it. I created 1,000 JPEGs of around 500kB each with:

parallel magick -size 640x480 xc: +noise random {}.jpg ::: {1..1000}

Then I tried the code from his answer using Wand, and got 21.3s for 1000 images. I then switched to PIL, and using the same images, the time dropped to 9.6s. I then had a think and realised that the Perceptual hash algorithm converts the image to greyscale and shrinks it to 8×8 pixels – and that the JPEG library has a "shrink-on-load" feature which you can use in PIL if you call Image.draft(newMode,newSize). That reduces the time to load and also the amount of I/O needed. Enabling that feature further reduces the time to 6s for the same images. The code looks like this:

#!/usr/bin/env python3
    
# https://stackoverflow.com/a/70709538/2836621
    
import dhash
import glob
from PIL import Image
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import Manager
    
PATH = '*.jpg'
    
def makehash(t):
    filename, d = t
    with Image.open(filename) as image:
        image.draft('L', (32,32))
        row, col = dhash.dhash_row_col(image)
        d[filename] = dhash.format_hex(row, col)
 
def main():
    with Manager() as manager:
        d = manager.dict()
        with ProcessPoolExecutor() as executor:
            executor.map(makehash, [(jpg, d) for jpg in glob.glob(PATH)])
        print(d)

if __name__ == '__main__':
    main()
Answered By: Mark Setchell
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.