Safest way to generate a unique hash in Python

Question:

I need to produce unique identifiers that can be used in filenames and can be reproduced given the same input values. I need to produce millions of these identifiers as the source input has millions of combinations.

For simplicity’s sake, I will use a small set in the example, but the actual sets can be rather large (hundreds, maybe thousands, of items); larger than could be manually encoded into a filename.

I noticed that the 5th method of generating UUID’s allows you to provide a string input.

> input_set = {'apple', 'banana', 'orange'}
> uuid.uuid5(uuid.NAMESPACE_URL, pickle.dumps(input_set)).hex
'f39926529ad45997984643816c1bc403'

The documentation says it uses SHA1 under the hood. Is the risk of a collision too high? Is there a better way of reliably hashing unique identifiers?

Asked By: Brendan Abel

||

Answers:

The odds that you’d get an SHA1 collision from strings is astoundingly low. Currently there are less than 63 known collisions for SHA1.

First ever SHA1 collision found

First ever’ SHA-1 hash collision calculated. All it took were five clever brains… and 6,610 years of processor time

SHA1 is no longer considered secure in the cryptography world, but certainly exceeds your expectations here.

Cryptographic hashing functions are designed to be one way functions.This means the functions inverse is "hard" to calculate. (i.e. knowing the output in no way helps you determine the input) As Blender pointed out in the comments this has nothing to do with the chance of collisions.

Take a look at the Birthday Paradox for some basic information on how the probability of a collision is calculated.

This question addresses the likely hood of a SHA1 collision. This article states

A cryptographic hash function has provable security against collision attacks if finding collisions is provably polynomial-time reducible from problem P which is supposed to be unsolvable in polynomial time. The function is then called provably secure, or just provable.

Here is a list of "secure" hash algorithms.

UPDATE
You stated in the comments your input is much larger than the 160 bit limit for SHA1. I recommend you use SHA3 in this case as there is no limit on the size of your input. Check out the Python documentation for more information.

Here is a basic example:

import sha3
k = sha3.keccak_512()
k.update(b"data")
k.hexdigest()
'1065aceeded3a5e4412e2187e919bffeadf815f5bd73d37fe00d384fe29f55f08462fdabe1007b993ce5b8119630e7db93101d9425d6e352e22ffe3dcb56b825'
Answered By: DoesData

Instead of using pysha3 (see DoesData‘s answer), you could also use the built-in hashlib:

import hashlib

h = hashlib.sha3_512() # Python 3.6+
h.update(b"Hello World")
h.hexdigest()

Output:

'3d58a719c6866b0214f96b0a67b37e51a91e233ce0be126a08f35fdf4c043c6126f40139bfbc338d44eb2a03de9f7bb8eff0ac260b3629811e389a5fbee8a894'
Answered By: Thomas

If the smaller base64.urlsafe_b64encode output would be preferable:

> import base64, hashlib

> base64.urlsafe_b64encode(hashlib.sha3_512('asdf'.encode()).digest())
b'jYjPWyD1Os164UebWzbcICF1OwSZAsdyR7snsTGzAL08qL7vKHVtzie4mQhnxFd6JTXn47dRQTmcoalMyEsOuQ=='

The above output is of length 88 whereas the corresponding hex would be of length 128.

Answered By: Asclepius
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.