safe enough 8-character short unique random string
Question:
I am trying to compute 8-character short unique random filenames for, let’s say, thousands of files without probable name collision. Is this method safe enough?
base64.urlsafe_b64encode(hashlib.md5(os.urandom(128)).digest())[:8]
Edit
To be clearer, I am trying to achieve simplest possible obfuscation of filenames being uploaded to a storage.
I figured out that 8-character string, random enough, would be very efficient and simple way to store tens of thousands of files without probable collision, when implemented right. I don’t need guaranteed uniqueness, only high-enough improbability of name collision (talking about only thousands of names).
Files are being stored in concurrent environment, so incrementing shared counter is achievable, but complicated. Storing counter in database would be inefficient.
I am also facing the fact that random() under some circumstances returns same pseudorandom sequences in different processes.
Answers:
Your current method should be safe enough, but you could also take a look into the uuid
module. e.g.
import uuid
print str(uuid.uuid4())[:8]
Output:
ef21b9ad
Is there a reason you can’t use tempfile
to generate the names?
Functions like mkstemp
and NamedTemporaryFile
are absolutely guaranteed to give you unique names; nothing based on random bytes is going to give you that.
If for some reason you don’t actually want the file created yet (e.g., you’re generating filenames to be used on some remote server or something), you can’t be perfectly safe, but mktemp
is still safer than random names.
Or just keep a 48-bit counter stored in some “global enough” location, so you guarantee going through the full cycle of names before a collision, and you also guarantee knowing when a collision is going to happen.
They’re all safer, and simpler, and much more efficient than reading urandom
and doing an md5
.
If you really do want to generate random names, ''.join(random.choice(my_charset) for _ in range(8))
is also going to be simpler than what you’re doing, and more efficient. Even urlsafe_b64encode(os.urandom(6))
is just as random as the MD5 hash, and simpler and more efficient.
The only benefit of the cryptographic randomness and/or cryptographic hash function is in avoiding predictability. If that’s not an issue for you, why pay for it? And if you do need to avoid predictability, you almost certainly need to avoid races and other much simpler attacks, so avoiding mkstemp
or NamedTemporaryFile
is a very bad idea.
Not to mention that, as Root points out in a comment, if you need security, MD5 doesn’t actually provide it.
You can try this
import random
uid_chars = ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u',
'v', 'w', 'x', 'y', 'z','1','2','3','4','5','6','7','8','9','0')
uid_length=8
def short_uid():
count=len(uid_chars)-1
c=''
for i in range(0,uid_length):
c+=uid_chars[random.randint(0,count)]
return c
eg:
print short_uid()
nogbomcv
I am using hashids to convert a timestamp into a unique id. (You can even convert it back to a timestamp if you want).
The drawback with this is if you create ids too fast, you will get a duplicate. But, if you are generating them with time in-between, then this is an option.
Here is an example:
from hashids import Hashids
from datetime import datetime
hashids = Hashids(salt = "lorem ipsum dolor sit amet", alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890")
print(hashids.encode(int(datetime.today().timestamp()))) #'QJW60PJ1' when I ran it
You can try the shortuuid library.
Install with : pip install shortuuid
Then it is as simple as :
> import shortuuid
> shortuuid.uuid()
'vytxeTZskVKR7C7WgdSP3d'
Which method has less collisions, is faster and easier to read?
TLDR
The random_choice
is the fastest, has fewer collisions but is IMO slightly harder to read.
The most readable is shortuuid_random
but is an external dependency and is slightly slower and has 6x the collisions.
The methods
alphabet = string.ascii_lowercase + string.digits
su = shortuuid.ShortUUID(alphabet=alphabet)
def random_choice():
return ''.join(random.choices(alphabet, k=8))
def truncated_uuid4():
return str(uuid.uuid4())[:8]
def shortuuid_random():
return su.random(length=8)
def secrets_random_choice():
return ''.join(secrets.choice(alphabet) for _ in range(8))
Results
All methods generate 8-character UUIDs from the abcdefghijklmnopqrstuvwxyz0123456789
alphabet. Collisions are calculated from a single run with 10 million draws. Time is reported in seconds as average function execution ± standard deviation, both calculated over 100 runs of 1,000 draws. Total time is the total execution time of the collision testing.
random_choice: collisions 22 - time (s) 0.00229 ± 0.00016 - total (s) 29.70518
truncated_uuid4: collisions 11711 - time (s) 0.00439 ± 0.00021 - total (s) 54.03649
shortuuid_random: collisions 124 - time (s) 0.00482 ± 0.00029 - total (s) 51.19624
secrets_random_choice: collisions 15 - time (s) 0.02113 ± 0.00072 - total (s) 228.23106
Notes
- the default
shortuuid
alphabet has uppercase characters, hence creating fewer collision. To make it a fair comparison we need to select the same alphabet as the other methods.
- the
secrets
methods token_hex
and token_urlsafe
while possibly faster, have different alphabets, hence not eligible for the comparison.
- the
alphabet
and class-based shortuuid
methods are factored out as module variables, hence speeding up the method execution. This should not affect the TLDR.
Full testing details
import random
import secrets
from statistics import mean
from statistics import stdev
import string
import time
import timeit
import uuid
import shortuuid
alphabet = string.ascii_lowercase + string.digits
su = shortuuid.ShortUUID(alphabet=alphabet)
def random_choice():
return ''.join(random.choices(alphabet, k=8))
def truncated_uuid4():
return str(uuid.uuid4())[:8]
def shortuuid_random():
return su.random(length=8)
def secrets_random_choice():
return ''.join(secrets.choice(alphabet) for _ in range(8))
def test_collisions(fun):
out = set()
count = 0
for _ in range(10_000_000):
new = fun()
if new in out:
count += 1
else:
out.add(new)
return count
def run_and_print_results(fun):
round_digits = 5
now = time.time()
collisions = test_collisions(fun)
total_time = round(time.time() - now, round_digits)
trials = 1_000
runs = 100
func_time = timeit.repeat(fun, repeat=runs, number=trials)
avg = round(mean(func_time), round_digits)
std = round(stdev(func_time), round_digits)
print(f'{fun.__name__}: collisions {collisions} - '
f'time (s) {avg} ± {std} - '
f'total (s) {total_time}')
if __name__ == '__main__':
run_and_print_results(random_choice)
run_and_print_results(truncated_uuid4)
run_and_print_results(shortuuid_random)
run_and_print_results(secrets_random_choice)
From Python 3.6 you should probably use the secrets
module. secrets.token_urlsafe()
seems to work for your case just fine, and it is guaranteed to use cryptographically safe random sources.
Fastest Deterministic Method
import random
import binascii
e = random.Random(seed)
binascii.b2a_base64(random.getrandbits(48).to_bytes(6, 'little'), newline=False)
Fastest System Random Method
import os
import binascii
binascii.b2a_base64(os.urandom(6), newline=False)
Url Safe Methods
Use os.urandom
import os
import base64
base64.urlsafe_b64encode(os.urandom(6)).decode()
Use random.Random.choices
(slow, but flexible)
import random
import string
alphabet = string.ascii_letters + string.digits + '-_'
''.join(random.choices(alphabet, k=8))
Use random.Random.getrandbits
(faster than random.Random.randbytes
)
import random
import base64
base64.urlsafe_b64encode(random.getrandbits(48).to_bytes(6, 'little')).decode()
Use random.Random.randbytes
(python >= 3.9)
import random
import base64
base64.urlsafe_b64encode(random.randbytes(6)).decode()
Use random.SystemRandom.randbytes
(python >= 3.9)
import random
import base64
e = random.SystemRandom()
base64.urlsafe_b64encode(e.randbytes(6)).decode()
random.SystemRandom.getrandbits
is not recommended if python >= 3.9, since it takes 2.5x time comparing to random.SystemRandom.randbytes
and is more complicated.
Use secrets.token_bytes
(python >= 3.6)
import secrets
import base64
base64.urlsafe_b64encode(secrets.token_bytes(6)).decode()
Use secrets.token_urlsafe
(python >= 3.6)
import secrets
secrets.token_urlsafe(6) # 6 byte base64 has 8 char
Further Discussion
secrets.token_urlsafe implementation in python3.9
tok = token_bytes(nbytes)
base64.urlsafe_b64encode(tok).rstrip(b'=').decode('ascii')
Since ASCII bytes .decode()
is faster than .decode('ascii')
,
and .rstrip(b'=')
is useless when nbytes % 6 == 0
.
base64.urlsafe_b64encode(secrets.token_bytes(nbytes)).decode()
is faster (~20%).
On Windows10, bytes based method is 2x faster when nbytes=6(8 char), and 5x faster when nbytes=24(32 char).
On Windows 10(my laptop), secrets.token_bytes
take similar time like random.Random.randbytes
, and base64.urlsafe_b64encode
take more time than random bytes generation.
On Ubuntu 20.04(my cloud server, may lack entropy), secrets.token_bytes
take 15x more time than random.Random.randbytes
, but take similar time like random.SystemRandom.randbytes
Since secrets.token_bytes
use random.SystemRandom.randbytes
use os.urandom
(thus they are exactly same), you may replace secrets.token_bytes
by os.urandom
if performance is crucial.
In Python3.9, base64.urlsafe_b64encode
is a combination of base64.b64encode
and bytes.translate
, thus take ~30% more time.
random.Random.randbytes(n)
is implemented by random.Random.getrandbits(n * 8).to_bytes(n, 'little')
, thus 3x slower. (However, random.SystemRandom.getrandbits
is implemented with random.SystemRandom.randbytes
)
base64.b32encode
is dramatically slower(5x for 6 bytes, 17x for 480 bytes) than base64.b64encode
because there are a lots of python code in base64.b32encode
, but base64.b64encode
just call binascii.b2a_base64
(C implemented).
However, there is a python branch statement if altchars is not None:
in base64.b64encode
, which will introduce not negligible overhead when process small data, binascii.b2a_base64(data, newline=False)
may be better.
I am trying to compute 8-character short unique random filenames for, let’s say, thousands of files without probable name collision. Is this method safe enough?
base64.urlsafe_b64encode(hashlib.md5(os.urandom(128)).digest())[:8]
Edit
To be clearer, I am trying to achieve simplest possible obfuscation of filenames being uploaded to a storage.
I figured out that 8-character string, random enough, would be very efficient and simple way to store tens of thousands of files without probable collision, when implemented right. I don’t need guaranteed uniqueness, only high-enough improbability of name collision (talking about only thousands of names).
Files are being stored in concurrent environment, so incrementing shared counter is achievable, but complicated. Storing counter in database would be inefficient.
I am also facing the fact that random() under some circumstances returns same pseudorandom sequences in different processes.
Your current method should be safe enough, but you could also take a look into the uuid
module. e.g.
import uuid
print str(uuid.uuid4())[:8]
Output:
ef21b9ad
Is there a reason you can’t use tempfile
to generate the names?
Functions like mkstemp
and NamedTemporaryFile
are absolutely guaranteed to give you unique names; nothing based on random bytes is going to give you that.
If for some reason you don’t actually want the file created yet (e.g., you’re generating filenames to be used on some remote server or something), you can’t be perfectly safe, but mktemp
is still safer than random names.
Or just keep a 48-bit counter stored in some “global enough” location, so you guarantee going through the full cycle of names before a collision, and you also guarantee knowing when a collision is going to happen.
They’re all safer, and simpler, and much more efficient than reading urandom
and doing an md5
.
If you really do want to generate random names, ''.join(random.choice(my_charset) for _ in range(8))
is also going to be simpler than what you’re doing, and more efficient. Even urlsafe_b64encode(os.urandom(6))
is just as random as the MD5 hash, and simpler and more efficient.
The only benefit of the cryptographic randomness and/or cryptographic hash function is in avoiding predictability. If that’s not an issue for you, why pay for it? And if you do need to avoid predictability, you almost certainly need to avoid races and other much simpler attacks, so avoiding mkstemp
or NamedTemporaryFile
is a very bad idea.
Not to mention that, as Root points out in a comment, if you need security, MD5 doesn’t actually provide it.
You can try this
import random
uid_chars = ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u',
'v', 'w', 'x', 'y', 'z','1','2','3','4','5','6','7','8','9','0')
uid_length=8
def short_uid():
count=len(uid_chars)-1
c=''
for i in range(0,uid_length):
c+=uid_chars[random.randint(0,count)]
return c
eg:
print short_uid()
nogbomcv
I am using hashids to convert a timestamp into a unique id. (You can even convert it back to a timestamp if you want).
The drawback with this is if you create ids too fast, you will get a duplicate. But, if you are generating them with time in-between, then this is an option.
Here is an example:
from hashids import Hashids
from datetime import datetime
hashids = Hashids(salt = "lorem ipsum dolor sit amet", alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890")
print(hashids.encode(int(datetime.today().timestamp()))) #'QJW60PJ1' when I ran it
You can try the shortuuid library.
Install with : pip install shortuuid
Then it is as simple as :
> import shortuuid
> shortuuid.uuid()
'vytxeTZskVKR7C7WgdSP3d'
Which method has less collisions, is faster and easier to read?
TLDR
The random_choice
is the fastest, has fewer collisions but is IMO slightly harder to read.
The most readable is shortuuid_random
but is an external dependency and is slightly slower and has 6x the collisions.
The methods
alphabet = string.ascii_lowercase + string.digits
su = shortuuid.ShortUUID(alphabet=alphabet)
def random_choice():
return ''.join(random.choices(alphabet, k=8))
def truncated_uuid4():
return str(uuid.uuid4())[:8]
def shortuuid_random():
return su.random(length=8)
def secrets_random_choice():
return ''.join(secrets.choice(alphabet) for _ in range(8))
Results
All methods generate 8-character UUIDs from the abcdefghijklmnopqrstuvwxyz0123456789
alphabet. Collisions are calculated from a single run with 10 million draws. Time is reported in seconds as average function execution ± standard deviation, both calculated over 100 runs of 1,000 draws. Total time is the total execution time of the collision testing.
random_choice: collisions 22 - time (s) 0.00229 ± 0.00016 - total (s) 29.70518
truncated_uuid4: collisions 11711 - time (s) 0.00439 ± 0.00021 - total (s) 54.03649
shortuuid_random: collisions 124 - time (s) 0.00482 ± 0.00029 - total (s) 51.19624
secrets_random_choice: collisions 15 - time (s) 0.02113 ± 0.00072 - total (s) 228.23106
Notes
- the default
shortuuid
alphabet has uppercase characters, hence creating fewer collision. To make it a fair comparison we need to select the same alphabet as the other methods. - the
secrets
methodstoken_hex
andtoken_urlsafe
while possibly faster, have different alphabets, hence not eligible for the comparison. - the
alphabet
and class-basedshortuuid
methods are factored out as module variables, hence speeding up the method execution. This should not affect the TLDR.
Full testing details
import random
import secrets
from statistics import mean
from statistics import stdev
import string
import time
import timeit
import uuid
import shortuuid
alphabet = string.ascii_lowercase + string.digits
su = shortuuid.ShortUUID(alphabet=alphabet)
def random_choice():
return ''.join(random.choices(alphabet, k=8))
def truncated_uuid4():
return str(uuid.uuid4())[:8]
def shortuuid_random():
return su.random(length=8)
def secrets_random_choice():
return ''.join(secrets.choice(alphabet) for _ in range(8))
def test_collisions(fun):
out = set()
count = 0
for _ in range(10_000_000):
new = fun()
if new in out:
count += 1
else:
out.add(new)
return count
def run_and_print_results(fun):
round_digits = 5
now = time.time()
collisions = test_collisions(fun)
total_time = round(time.time() - now, round_digits)
trials = 1_000
runs = 100
func_time = timeit.repeat(fun, repeat=runs, number=trials)
avg = round(mean(func_time), round_digits)
std = round(stdev(func_time), round_digits)
print(f'{fun.__name__}: collisions {collisions} - '
f'time (s) {avg} ± {std} - '
f'total (s) {total_time}')
if __name__ == '__main__':
run_and_print_results(random_choice)
run_and_print_results(truncated_uuid4)
run_and_print_results(shortuuid_random)
run_and_print_results(secrets_random_choice)
From Python 3.6 you should probably use the secrets
module. secrets.token_urlsafe()
seems to work for your case just fine, and it is guaranteed to use cryptographically safe random sources.
Fastest Deterministic Method
import random
import binascii
e = random.Random(seed)
binascii.b2a_base64(random.getrandbits(48).to_bytes(6, 'little'), newline=False)
Fastest System Random Method
import os
import binascii
binascii.b2a_base64(os.urandom(6), newline=False)
Url Safe Methods
Use os.urandom
import os
import base64
base64.urlsafe_b64encode(os.urandom(6)).decode()
Use random.Random.choices
(slow, but flexible)
import random
import string
alphabet = string.ascii_letters + string.digits + '-_'
''.join(random.choices(alphabet, k=8))
Use random.Random.getrandbits
(faster than random.Random.randbytes
)
import random
import base64
base64.urlsafe_b64encode(random.getrandbits(48).to_bytes(6, 'little')).decode()
Use random.Random.randbytes
(python >= 3.9)
import random
import base64
base64.urlsafe_b64encode(random.randbytes(6)).decode()
Use random.SystemRandom.randbytes
(python >= 3.9)
import random
import base64
e = random.SystemRandom()
base64.urlsafe_b64encode(e.randbytes(6)).decode()
random.SystemRandom.getrandbits
is not recommended if python >= 3.9, since it takes 2.5x time comparing to random.SystemRandom.randbytes
and is more complicated.
Use secrets.token_bytes
(python >= 3.6)
import secrets
import base64
base64.urlsafe_b64encode(secrets.token_bytes(6)).decode()
Use secrets.token_urlsafe
(python >= 3.6)
import secrets
secrets.token_urlsafe(6) # 6 byte base64 has 8 char
Further Discussion
secrets.token_urlsafe implementation in python3.9
tok = token_bytes(nbytes)
base64.urlsafe_b64encode(tok).rstrip(b'=').decode('ascii')
Since ASCII bytes .decode()
is faster than .decode('ascii')
,
and .rstrip(b'=')
is useless when nbytes % 6 == 0
.
base64.urlsafe_b64encode(secrets.token_bytes(nbytes)).decode()
is faster (~20%).
On Windows10, bytes based method is 2x faster when nbytes=6(8 char), and 5x faster when nbytes=24(32 char).
On Windows 10(my laptop), secrets.token_bytes
take similar time like random.Random.randbytes
, and base64.urlsafe_b64encode
take more time than random bytes generation.
On Ubuntu 20.04(my cloud server, may lack entropy), secrets.token_bytes
take 15x more time than random.Random.randbytes
, but take similar time like random.SystemRandom.randbytes
Since secrets.token_bytes
use random.SystemRandom.randbytes
use os.urandom
(thus they are exactly same), you may replace secrets.token_bytes
by os.urandom
if performance is crucial.
In Python3.9, base64.urlsafe_b64encode
is a combination of base64.b64encode
and bytes.translate
, thus take ~30% more time.
random.Random.randbytes(n)
is implemented by random.Random.getrandbits(n * 8).to_bytes(n, 'little')
, thus 3x slower. (However, random.SystemRandom.getrandbits
is implemented with random.SystemRandom.randbytes
)
base64.b32encode
is dramatically slower(5x for 6 bytes, 17x for 480 bytes) than base64.b64encode
because there are a lots of python code in base64.b32encode
, but base64.b64encode
just call binascii.b2a_base64
(C implemented).
However, there is a python branch statement if altchars is not None:
in base64.b64encode
, which will introduce not negligible overhead when process small data, binascii.b2a_base64(data, newline=False)
may be better.