Generate ID from string in Python
Question:
I’m struggling a bit to generate ID of type integer
for given string
in Python.
I thought the built-it hash
function is perfect but it appears that the IDs are too long sometimes. It’s a problem since I’m limited to 64bits as maximum length.
My code so far: hash(s) % 10000000000
.
The input string(s) which I can expect will be in range of 12-512 chars long.
Requirements are:
- integers only
- generated from provided string
- ideally up to 10-12 chars long (I’ll have ~5 million items only)
- low probability of collision..?
I would be glad if someone can provide any tips / solutions.
Answers:
I would do something like this:
>>> import hashlib
>>> m = hashlib.md5()
>>> m.update("some string")
>>> str(int(m.hexdigest(), 16))[0:12]
'120665287271'
The idea:
- Calculate the hash of a string with MD5 (or SHA-1 or …) in hexadecimal form (see module hashlib)
- Convert the string into an integer and reconvert it to a String with base 10 (there are just digits in the result)
- Use the first 12 characters of the string.
If characters a-f
are also okay, I would do m.hexdigest()[0:12]
.
If you’re not allowed to add extra dependency, you can continue using hash
function in the following way:
>>> my_string = "whatever"
>>> str(hash(my_string))[1:13]
'460440266319'
NB:
- I am ignoring 1st character as it may be the negative sign.
hash
may return different values for same string, as PYTHONHASHSEED
Value will change everytime you run your program. You may want to set it to some fixed value. Read here
encode utf-8 was needed for mine to work:
def unique_name_from_str(string: str, last_idx: int = 12) -> str:
"""
Generates a unique id name
refs:
- md5: https://stackoverflow.com/questions/22974499/generate-id-from-string-in-python
- sha3: https://stackoverflow.com/questions/47601592/safest-way-to-generate-a-unique-hash
(- guid/uiid: https://stackoverflow.com/questions/534839/how-to-create-a-guid-uuid-in-python?noredirect=1&lq=1)
"""
import hashlib
m = hashlib.md5()
string = string.encode('utf-8')
m.update(string)
unqiue_name: str = str(int(m.hexdigest(), 16))[0:last_idx]
return unqiue_name
see my ultimate-utils python library.
Hash function seems to generate a different output from the same input string each time the kernel is restarted. Is there any method to avoid that? I would need the function to generate the same output for a given input every time. Perhaps the hash is not the answer but what would that be?
Br,
Antti
I’m struggling a bit to generate ID of type integer
for given string
in Python.
I thought the built-it hash
function is perfect but it appears that the IDs are too long sometimes. It’s a problem since I’m limited to 64bits as maximum length.
My code so far: hash(s) % 10000000000
.
The input string(s) which I can expect will be in range of 12-512 chars long.
Requirements are:
- integers only
- generated from provided string
- ideally up to 10-12 chars long (I’ll have ~5 million items only)
- low probability of collision..?
I would be glad if someone can provide any tips / solutions.
I would do something like this:
>>> import hashlib
>>> m = hashlib.md5()
>>> m.update("some string")
>>> str(int(m.hexdigest(), 16))[0:12]
'120665287271'
The idea:
- Calculate the hash of a string with MD5 (or SHA-1 or …) in hexadecimal form (see module hashlib)
- Convert the string into an integer and reconvert it to a String with base 10 (there are just digits in the result)
- Use the first 12 characters of the string.
If characters a-f
are also okay, I would do m.hexdigest()[0:12]
.
If you’re not allowed to add extra dependency, you can continue using hash
function in the following way:
>>> my_string = "whatever"
>>> str(hash(my_string))[1:13]
'460440266319'
NB:
- I am ignoring 1st character as it may be the negative sign.
hash
may return different values for same string, asPYTHONHASHSEED
Value will change everytime you run your program. You may want to set it to some fixed value. Read here
encode utf-8 was needed for mine to work:
def unique_name_from_str(string: str, last_idx: int = 12) -> str:
"""
Generates a unique id name
refs:
- md5: https://stackoverflow.com/questions/22974499/generate-id-from-string-in-python
- sha3: https://stackoverflow.com/questions/47601592/safest-way-to-generate-a-unique-hash
(- guid/uiid: https://stackoverflow.com/questions/534839/how-to-create-a-guid-uuid-in-python?noredirect=1&lq=1)
"""
import hashlib
m = hashlib.md5()
string = string.encode('utf-8')
m.update(string)
unqiue_name: str = str(int(m.hexdigest(), 16))[0:last_idx]
return unqiue_name
see my ultimate-utils python library.
Hash function seems to generate a different output from the same input string each time the kernel is restarted. Is there any method to avoid that? I would need the function to generate the same output for a given input every time. Perhaps the hash is not the answer but what would that be?
Br,
Antti