Persistent Hashing of Strings in Python

Question:

How would you convert an arbitrary string into a unique integer, which would be the same across Python sessions and platforms? For example hash('my string') wouldn’t work because a different value is returned for each Python session and platform.

Asked By: Cerin

||

Answers:

Use a hash algorithm such as MD5 or SHA1, then convert the hexdigest via int():

>>> import hashlib
>>> int(hashlib.md5('Hello, world!').hexdigest(), 16)
144653930895353261282233826065192032313L

First off, you probably don’t really want the integers to be actually unique. If you do then your numbers might be unlimited in size. If that really is what you want then you could use a bignum library and interpret the bits of the string as the representation of a (potentially very large) integer. If your strings can include the character then you should prepend a 1, so you can distinguish e.g. “” from “”.

Now, if you prefer bounded-size numbers you’ll be using some form of hashing. MD5 will work but it’s overkill for the stated purpose. I recommend using sdbm instead, it works very well. In C it looks like this:

static unsigned long sdbm(unsigned char *str)
{
    unsigned long hash = 0;
    int c;

    while (c = *str++)
        hash = c + (hash << 6) + (hash << 16) - hash;

    return hash;
}

The source, http://www.cse.yorku.ca/~oz/hash.html, also presents a few other hash functions.

Answered By: redtuna

If a hash function really won’t work for you, you can turn the string into a number.

my_string = 'my string'
def string_to_int(s):
    ord3 = lambda x : '%.3d' % ord(x)
    return int(''.join(map(ord3, s)))

In[10]: string_to_int(my_string)
Out[11]: 109121032115116114105110103L

This is invertible, by mapping each triplet through chr.

def int_to_string(n)
    s = str(n)
    return ''.join([chr(int(s[i:i+3])) for i in range(0, len(s), 3)])

In[12]: int_to_string(109121032115116114105110103L)
Out[13]: 'my string'
Answered By: Jason Sundram

Here are my python27 implementation for algorithms listed here: http://www.cse.yorku.ca/~oz/hash.html.
No idea if they are efficient or not.

from ctypes import c_ulong

def ulong(i): return c_ulong(i).value  # numpy would be better if available

def djb2(L):
  """
  h = 5381
  for c in L:
    h = ((h << 5) + h) + ord(c) # h * 33 + c
  return h
  """
  return reduce(lambda h,c: ord(c) + ((h << 5) + h), L, 5381)

def djb2_l(L):
  return reduce(lambda h,c: ulong(ord(c) + ((h << 5) + h)), L, 5381)

def sdbm(L):
  """
  h = 0
  for c in L:
    h = ord(c) + (h << 6) + (h << 16) - h
  return h
  """
  return reduce(lambda h,c: ord(c) + (h << 6) + (h << 16) - h, L, 0)

def sdbm_l(L):
  return reduce(lambda h,c: ulong(ord(c) + (h << 6) + (h << 16) - h), L, 0)

def loselose(L):
  """
  h = 0
  for c in L:
    h += ord(c);
    return h
  """
  return sum(ord(c) for c in L)

def loselose_l(L):
  return reduce(lambda h,c: ulong(ord(c) + h), L, 0)
Answered By: jichi

Here’s another option, quite crude (probably has many collisions) and not very legible.

It worked for the purpose of generating an int (and later on, a random color) for different strings:

aString = "don't panic"
reduce( lambda x,y:x+y, map( lambda x:ord(x[0])*x[1],zip( aString, range( 1, len( aString ) ) ) ) )
Answered By: Dan Wills
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.