Slow performance in hashing images in Python
Question:
I have a function to create image difference hash and stored in a list in Python:
import glob
import dhash
from alive_progress import alive_bar
from wand.image import Image
def get_photo_hashes(dir='./'):
file_list = {}
imgs = glob.glob(dir + '*.jpg')
total = len(list(imgs))
with alive_bar(total) as bar:
for i in imgs:
with Image(filename=i) as image:
row, col = dhash.dhash_row_col(image)
hash_val = dhash.format_hex(row, col)
file_list[i] = hash_val
bar()
return file_list
The performance of hashing a folder of 10,000 500kb – 1MB JPEG images are surprisingly slow, around 2 hashes per second. How can I enhance the performance of this function? Thanks.
Answers:
Multiprocessing would ideal for this as the hashing is CPU intensive.
Here’s what I suggest:
import dhash
import glob
from wand.image import Image
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import Manager
PATH = '*.jpg'
def makehash(t):
filename, d = t
with Image(filename=filename) as image:
row, col = dhash.dhash_row_col(image)
d[filename] = dhash.format_hex(row, col)
def main():
with Manager() as manager:
d = manager.dict()
with ProcessPoolExecutor() as executor:
executor.map(makehash, [(jpg, d) for jpg in glob.glob(PATH)])
print(d)
if __name__ == '__main__':
main()
Some stats:
I have a folder containing 129 JPGs. The average size of each file is >12MB. The net processing time is ~19s
I like @JCaesar answer a lot, and decided to have a play with it. I created 1,000 JPEGs of around 500kB each with:
parallel magick -size 640x480 xc: +noise random {}.jpg ::: {1..1000}
Then I tried the code from his answer using Wand, and got 21.3s for 1000 images. I then switched to PIL, and using the same images, the time dropped to 9.6s. I then had a think and realised that the Perceptual hash algorithm converts the image to greyscale and shrinks it to 8×8 pixels – and that the JPEG library has a "shrink-on-load" feature which you can use in PIL if you call Image.draft(newMode,newSize)
. That reduces the time to load and also the amount of I/O needed. Enabling that feature further reduces the time to 6s for the same images. The code looks like this:
#!/usr/bin/env python3
# https://stackoverflow.com/a/70709538/2836621
import dhash
import glob
from PIL import Image
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import Manager
PATH = '*.jpg'
def makehash(t):
filename, d = t
with Image.open(filename) as image:
image.draft('L', (32,32))
row, col = dhash.dhash_row_col(image)
d[filename] = dhash.format_hex(row, col)
def main():
with Manager() as manager:
d = manager.dict()
with ProcessPoolExecutor() as executor:
executor.map(makehash, [(jpg, d) for jpg in glob.glob(PATH)])
print(d)
if __name__ == '__main__':
main()
I have a function to create image difference hash and stored in a list in Python:
import glob
import dhash
from alive_progress import alive_bar
from wand.image import Image
def get_photo_hashes(dir='./'):
file_list = {}
imgs = glob.glob(dir + '*.jpg')
total = len(list(imgs))
with alive_bar(total) as bar:
for i in imgs:
with Image(filename=i) as image:
row, col = dhash.dhash_row_col(image)
hash_val = dhash.format_hex(row, col)
file_list[i] = hash_val
bar()
return file_list
The performance of hashing a folder of 10,000 500kb – 1MB JPEG images are surprisingly slow, around 2 hashes per second. How can I enhance the performance of this function? Thanks.
Multiprocessing would ideal for this as the hashing is CPU intensive.
Here’s what I suggest:
import dhash
import glob
from wand.image import Image
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import Manager
PATH = '*.jpg'
def makehash(t):
filename, d = t
with Image(filename=filename) as image:
row, col = dhash.dhash_row_col(image)
d[filename] = dhash.format_hex(row, col)
def main():
with Manager() as manager:
d = manager.dict()
with ProcessPoolExecutor() as executor:
executor.map(makehash, [(jpg, d) for jpg in glob.glob(PATH)])
print(d)
if __name__ == '__main__':
main()
Some stats:
I have a folder containing 129 JPGs. The average size of each file is >12MB. The net processing time is ~19s
I like @JCaesar answer a lot, and decided to have a play with it. I created 1,000 JPEGs of around 500kB each with:
parallel magick -size 640x480 xc: +noise random {}.jpg ::: {1..1000}
Then I tried the code from his answer using Wand, and got 21.3s for 1000 images. I then switched to PIL, and using the same images, the time dropped to 9.6s. I then had a think and realised that the Perceptual hash algorithm converts the image to greyscale and shrinks it to 8×8 pixels – and that the JPEG library has a "shrink-on-load" feature which you can use in PIL if you call Image.draft(newMode,newSize)
. That reduces the time to load and also the amount of I/O needed. Enabling that feature further reduces the time to 6s for the same images. The code looks like this:
#!/usr/bin/env python3
# https://stackoverflow.com/a/70709538/2836621
import dhash
import glob
from PIL import Image
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import Manager
PATH = '*.jpg'
def makehash(t):
filename, d = t
with Image.open(filename) as image:
image.draft('L', (32,32))
row, col = dhash.dhash_row_col(image)
d[filename] = dhash.format_hex(row, col)
def main():
with Manager() as manager:
d = manager.dict()
with ProcessPoolExecutor() as executor:
executor.map(makehash, [(jpg, d) for jpg in glob.glob(PATH)])
print(d)
if __name__ == '__main__':
main()