Generate ID from string in Python

Question:

I’m struggling a bit to generate ID of type integer for given string in Python.

I thought the built-it hash function is perfect but it appears that the IDs are too long sometimes. It’s a problem since I’m limited to 64bits as maximum length.

My code so far: hash(s) % 10000000000.
The input string(s) which I can expect will be in range of 12-512 chars long.

Requirements are:

  • integers only
  • generated from provided string
  • ideally up to 10-12 chars long (I’ll have ~5 million items only)
  • low probability of collision..?

I would be glad if someone can provide any tips / solutions.

Asked By: mlen108

||

Answers:

I would do something like this:

>>> import hashlib
>>> m = hashlib.md5()
>>> m.update("some string")
>>> str(int(m.hexdigest(), 16))[0:12]
'120665287271'

The idea:

  1. Calculate the hash of a string with MD5 (or SHA-1 or …) in hexadecimal form (see module hashlib)
  2. Convert the string into an integer and reconvert it to a String with base 10 (there are just digits in the result)
  3. Use the first 12 characters of the string.

If characters a-f are also okay, I would do m.hexdigest()[0:12].

Answered By: Stephan Kulla

If you’re not allowed to add extra dependency, you can continue using hash function in the following way:

>>> my_string = "whatever"
>>> str(hash(my_string))[1:13]
'460440266319'

NB:

  • I am ignoring 1st character as it may be the negative sign.
  • hash may return different values for same string, as PYTHONHASHSEED Value will change everytime you run your program. You may want to set it to some fixed value. Read here
Answered By: Aditya Shaw

encode utf-8 was needed for mine to work:

def unique_name_from_str(string: str, last_idx: int = 12) -> str:
    """
    Generates a unique id name
    refs:
    - md5: https://stackoverflow.com/questions/22974499/generate-id-from-string-in-python
    - sha3: https://stackoverflow.com/questions/47601592/safest-way-to-generate-a-unique-hash
    (- guid/uiid: https://stackoverflow.com/questions/534839/how-to-create-a-guid-uuid-in-python?noredirect=1&lq=1)
    """
    import hashlib
    m = hashlib.md5()
    string = string.encode('utf-8')
    m.update(string)
    unqiue_name: str = str(int(m.hexdigest(), 16))[0:last_idx]
    return unqiue_name

see my ultimate-utils python library.

Answered By: Charlie Parker

Hash function seems to generate a different output from the same input string each time the kernel is restarted. Is there any method to avoid that? I would need the function to generate the same output for a given input every time. Perhaps the hash is not the answer but what would that be?

Br,
Antti

Answered By: Antti Siirilä
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.