Deterministic hashing in Python 3
Question:
I’m using hashing of strings for seeding random states in the following way:
context = "string"
seed = hash(context) % 4294967295 # This is necessary to keep the hash within allowed seed values
np.random.seed(seed)
This is unfortunately (for my usage) non-deterministic between runs in Python 3.3 and up. I do know that I could set the PYTHONHASHSEED
environment variable to an integer value to regain the determinism, but I would probably prefer something that feels a bit less hacky, and won’t entirely disregard the extra security added by random hashing. Suggestions?
Answers:
Use a purpose-built hash function. zlib.adler32()
is an excellent choice; alternatively, check out the hashlib
module for more options.
Forcing Python’s built-in hash
to be deterministic is intrinsically hacky. If you want to avoid hackitude, use a different hashing function — see e.g in Python-2: https://docs.python.org/2/library/hashlib.html,
and in Python-3: https://docs.python.org/3/library/hashlib.html
You can actually use a string as seed for random.Random
:
>>> import random
>>> r = random.Random('string'); [r.randrange(10) for _ in range(20)]
[0, 6, 3, 6, 4, 4, 6, 9, 9, 9, 9, 9, 5, 7, 5, 3, 0, 4, 8, 1]
>>> r = random.Random('string'); [r.randrange(10) for _ in range(20)]
[0, 6, 3, 6, 4, 4, 6, 9, 9, 9, 9, 9, 5, 7, 5, 3, 0, 4, 8, 1]
>>> r = random.Random('string'); [r.randrange(10) for _ in range(20)]
[0, 6, 3, 6, 4, 4, 6, 9, 9, 9, 9, 9, 5, 7, 5, 3, 0, 4, 8, 1]
>>> r = random.Random('another_string'); [r.randrange(10) for _ in range(20)]
[8, 7, 1, 8, 3, 8, 6, 1, 6, 5, 5, 3, 3, 6, 6, 3, 8, 5, 8, 4]
>>> r = random.Random('another_string'); [r.randrange(10) for _ in range(20)]
[8, 7, 1, 8, 3, 8, 6, 1, 6, 5, 5, 3, 3, 6, 6, 3, 8, 5, 8, 4]
>>> r = random.Random('another_string'); [r.randrange(10) for _ in range(20)]
[8, 7, 1, 8, 3, 8, 6, 1, 6, 5, 5, 3, 3, 6, 6, 3, 8, 5, 8, 4]
It can be convenient, e.g. to use the basename of an input file as seed. For the same input file, the generated numbers will always be the same.
I’m using hashing of strings for seeding random states in the following way:
context = "string"
seed = hash(context) % 4294967295 # This is necessary to keep the hash within allowed seed values
np.random.seed(seed)
This is unfortunately (for my usage) non-deterministic between runs in Python 3.3 and up. I do know that I could set the PYTHONHASHSEED
environment variable to an integer value to regain the determinism, but I would probably prefer something that feels a bit less hacky, and won’t entirely disregard the extra security added by random hashing. Suggestions?
Use a purpose-built hash function. zlib.adler32()
is an excellent choice; alternatively, check out the hashlib
module for more options.
Forcing Python’s built-in hash
to be deterministic is intrinsically hacky. If you want to avoid hackitude, use a different hashing function — see e.g in Python-2: https://docs.python.org/2/library/hashlib.html,
and in Python-3: https://docs.python.org/3/library/hashlib.html
You can actually use a string as seed for random.Random
:
>>> import random
>>> r = random.Random('string'); [r.randrange(10) for _ in range(20)]
[0, 6, 3, 6, 4, 4, 6, 9, 9, 9, 9, 9, 5, 7, 5, 3, 0, 4, 8, 1]
>>> r = random.Random('string'); [r.randrange(10) for _ in range(20)]
[0, 6, 3, 6, 4, 4, 6, 9, 9, 9, 9, 9, 5, 7, 5, 3, 0, 4, 8, 1]
>>> r = random.Random('string'); [r.randrange(10) for _ in range(20)]
[0, 6, 3, 6, 4, 4, 6, 9, 9, 9, 9, 9, 5, 7, 5, 3, 0, 4, 8, 1]
>>> r = random.Random('another_string'); [r.randrange(10) for _ in range(20)]
[8, 7, 1, 8, 3, 8, 6, 1, 6, 5, 5, 3, 3, 6, 6, 3, 8, 5, 8, 4]
>>> r = random.Random('another_string'); [r.randrange(10) for _ in range(20)]
[8, 7, 1, 8, 3, 8, 6, 1, 6, 5, 5, 3, 3, 6, 6, 3, 8, 5, 8, 4]
>>> r = random.Random('another_string'); [r.randrange(10) for _ in range(20)]
[8, 7, 1, 8, 3, 8, 6, 1, 6, 5, 5, 3, 3, 6, 6, 3, 8, 5, 8, 4]
It can be convenient, e.g. to use the basename of an input file as seed. For the same input file, the generated numbers will always be the same.