When is hash(n) == n in Python?
Question:
I’ve been playing with Python’s hash function. For small integers, it appears hash(n) == n
always. However this does not extend to large numbers:
>>> hash(2**100) == 2**100
False
I’m not surprised, I understand hash takes a finite range of values. What is that range?
I tried using binary search to find the smallest number hash(n) != n
>>> import codejamhelpers # pip install codejamhelpers
>>> help(codejamhelpers.binary_search)
Help on function binary_search in module codejamhelpers.binary_search:
binary_search(f, t)
Given an increasing function :math:`f`, find the greatest non-negative integer :math:`n` such that :math:`f(n) le t`. If :math:`f(n) > t` for all :math:`n ge 0`, return None.
>>> f = lambda n: int(hash(n) != n)
>>> n = codejamhelpers.binary_search(f, 0)
>>> hash(n)
2305843009213693950
>>> hash(n+1)
0
What’s special about 2305843009213693951? I note it’s less than sys.maxsize == 9223372036854775807
Edit: I’m using Python 3. I ran the same binary search on Python 2 and got a different result 2147483648, which I note is sys.maxint+1
I also played with [hash(random.random()) for i in range(10**6)]
to estimate the range of hash function. The max is consistently below n above. Comparing the min, it seems Python 3’s hash is always positively valued, whereas Python 2’s hash can take negative values.
Answers:
The implementation for the int type in cpython can be found here.
It just returns the value, except for -1
, than it returns -2
:
static long
int_hash(PyIntObject *v)
{
/* XXX If this is changed, you also need to change the way
Python's long, float and complex types are hashed. */
long x = v -> ob_ival;
if (x == -1)
x = -2;
return x;
}
Hash function returns plain int that means that returned value is greater than -sys.maxint
and lower than sys.maxint
, which means if you pass sys.maxint + x
to it result would be -sys.maxint + (x - 2)
.
hash(sys.maxint + 1) == sys.maxint + 1 # False
hash(sys.maxint + 1) == - sys.maxint -1 # True
hash(sys.maxint + sys.maxint) == -sys.maxint + sys.maxint - 2 # True
Meanwhile 2**200
is a n
times greater than sys.maxint
– my guess is that hash would go over range -sys.maxint..+sys.maxint
n times until it stops on plain integer in that range, like in code snippets above..
So generally, for any n <= sys.maxint:
hash(sys.maxint*n) == -sys.maxint*(n%2) + 2*(n%2)*sys.maxint - n/2 - (n + 1)%2 ## True
Note: this is true for python 2.
2305843009213693951
is 2^61 - 1
. It’s the largest Mersenne prime that fits into 64 bits.
If you have to make a hash just by taking the value mod some number, then a large Mersenne prime is a good choice — it’s easy to compute and ensures an even distribution of possibilities. (Although I personally would never make a hash this way)
It’s especially convenient to compute the modulus for floating point numbers. They have an exponential component that multiplies the whole number by 2^x
. Since 2^61 = 1 mod 2^61-1
, you only need to consider the (exponent) mod 61
.
Based on python documentation in pyhash.c
file:
For numeric types, the hash of a number x is based on the reduction
of x modulo the prime P = 2**_PyHASH_BITS - 1
. It’s designed so that
hash(x) == hash(y)
whenever x and y are numerically equal, even if
x and y have different types.
So for a 64/32 bit machine, the reduction would be 2 _PyHASH_BITS – 1, but what is _PyHASH_BITS
?
You can find it in pyhash.h
header file which for a 64 bit machine has been defined as 61 (you can read more explanation in pyconfig.h
file).
#if SIZEOF_VOID_P >= 8
# define _PyHASH_BITS 61
#else
# define _PyHASH_BITS 31
#endif
So first off all it’s based on your platform for example in my 64bit Linux platform the reduction is 261-1, which is 2305843009213693951
:
>>> 2**61 - 1
2305843009213693951
Also You can use math.frexp
in order to get the mantissa and exponent of sys.maxint
which for a 64 bit machine shows that max int is 263:
>>> import math
>>> math.frexp(sys.maxint)
(0.5, 64)
And you can see the difference by a simple test:
>>> hash(2**62) == 2**62
True
>>> hash(2**63) == 2**63
False
Read the complete documentation about python hashing algorithm https://github.com/python/cpython/blob/master/Python/pyhash.c#L34
As mentioned in comment you can use sys.hash_info
(in python 3.X) which will give you a struct sequence of parameters used for computing
hashes.
>>> sys.hash_info
sys.hash_info(width=64, modulus=2305843009213693951, inf=314159, nan=0, imag=1000003, algorithm='siphash24', hash_bits=64, seed_bits=128, cutoff=0)
>>>
Alongside the modulus that I’ve described in preceding lines, you can also get the inf
value as following:
>>> hash(float('inf'))
314159
>>> sys.hash_info.inf
314159
I’ve been playing with Python’s hash function. For small integers, it appears hash(n) == n
always. However this does not extend to large numbers:
>>> hash(2**100) == 2**100
False
I’m not surprised, I understand hash takes a finite range of values. What is that range?
I tried using binary search to find the smallest number hash(n) != n
>>> import codejamhelpers # pip install codejamhelpers
>>> help(codejamhelpers.binary_search)
Help on function binary_search in module codejamhelpers.binary_search:
binary_search(f, t)
Given an increasing function :math:`f`, find the greatest non-negative integer :math:`n` such that :math:`f(n) le t`. If :math:`f(n) > t` for all :math:`n ge 0`, return None.
>>> f = lambda n: int(hash(n) != n)
>>> n = codejamhelpers.binary_search(f, 0)
>>> hash(n)
2305843009213693950
>>> hash(n+1)
0
What’s special about 2305843009213693951? I note it’s less than sys.maxsize == 9223372036854775807
Edit: I’m using Python 3. I ran the same binary search on Python 2 and got a different result 2147483648, which I note is sys.maxint+1
I also played with [hash(random.random()) for i in range(10**6)]
to estimate the range of hash function. The max is consistently below n above. Comparing the min, it seems Python 3’s hash is always positively valued, whereas Python 2’s hash can take negative values.
The implementation for the int type in cpython can be found here.
It just returns the value, except for -1
, than it returns -2
:
static long
int_hash(PyIntObject *v)
{
/* XXX If this is changed, you also need to change the way
Python's long, float and complex types are hashed. */
long x = v -> ob_ival;
if (x == -1)
x = -2;
return x;
}
Hash function returns plain int that means that returned value is greater than -sys.maxint
and lower than sys.maxint
, which means if you pass sys.maxint + x
to it result would be -sys.maxint + (x - 2)
.
hash(sys.maxint + 1) == sys.maxint + 1 # False
hash(sys.maxint + 1) == - sys.maxint -1 # True
hash(sys.maxint + sys.maxint) == -sys.maxint + sys.maxint - 2 # True
Meanwhile 2**200
is a n
times greater than sys.maxint
– my guess is that hash would go over range -sys.maxint..+sys.maxint
n times until it stops on plain integer in that range, like in code snippets above..
So generally, for any n <= sys.maxint:
hash(sys.maxint*n) == -sys.maxint*(n%2) + 2*(n%2)*sys.maxint - n/2 - (n + 1)%2 ## True
Note: this is true for python 2.
2305843009213693951
is 2^61 - 1
. It’s the largest Mersenne prime that fits into 64 bits.
If you have to make a hash just by taking the value mod some number, then a large Mersenne prime is a good choice — it’s easy to compute and ensures an even distribution of possibilities. (Although I personally would never make a hash this way)
It’s especially convenient to compute the modulus for floating point numbers. They have an exponential component that multiplies the whole number by 2^x
. Since 2^61 = 1 mod 2^61-1
, you only need to consider the (exponent) mod 61
.
Based on python documentation in pyhash.c
file:
For numeric types, the hash of a number x is based on the reduction
of x modulo the primeP = 2**_PyHASH_BITS - 1
. It’s designed so that
hash(x) == hash(y)
whenever x and y are numerically equal, even if
x and y have different types.
So for a 64/32 bit machine, the reduction would be 2 _PyHASH_BITS – 1, but what is _PyHASH_BITS
?
You can find it in pyhash.h
header file which for a 64 bit machine has been defined as 61 (you can read more explanation in pyconfig.h
file).
#if SIZEOF_VOID_P >= 8
# define _PyHASH_BITS 61
#else
# define _PyHASH_BITS 31
#endif
So first off all it’s based on your platform for example in my 64bit Linux platform the reduction is 261-1, which is 2305843009213693951
:
>>> 2**61 - 1
2305843009213693951
Also You can use math.frexp
in order to get the mantissa and exponent of sys.maxint
which for a 64 bit machine shows that max int is 263:
>>> import math
>>> math.frexp(sys.maxint)
(0.5, 64)
And you can see the difference by a simple test:
>>> hash(2**62) == 2**62
True
>>> hash(2**63) == 2**63
False
Read the complete documentation about python hashing algorithm https://github.com/python/cpython/blob/master/Python/pyhash.c#L34
As mentioned in comment you can use sys.hash_info
(in python 3.X) which will give you a struct sequence of parameters used for computing
hashes.
>>> sys.hash_info
sys.hash_info(width=64, modulus=2305843009213693951, inf=314159, nan=0, imag=1000003, algorithm='siphash24', hash_bits=64, seed_bits=128, cutoff=0)
>>>
Alongside the modulus that I’ve described in preceding lines, you can also get the inf
value as following:
>>> hash(float('inf'))
314159
>>> sys.hash_info.inf
314159