# Fastest way to generate a random-like unique string with random length in Python 3

## Question:

I know how to create random string, like:

``````''.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(N))
``````

However, there should be no duplicates so what I am currently just checking if the key already exists in a list, like shown in the following code:

``````import secrets
import string
import numpy as np

amount_of_keys = 40000

keys = []

for i in range(0,amount_of_keys):
N = np.random.randint(12,20)
n_key = ''.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(N))
if not n_key in keys:
keys.append(n_key)
``````

Which is okay for a small amount of keys like `40000`, however the problem does not scale well the more keys there are. So I am wondering if there is a faster way to get to the result for even more keys, like `999999`

## Basic improvements, sets and local names

Use a set, not a list, and testing for uniqueness is much faster; set membership testing takes constant time independent of the set size, while lists take O(N) linear time. Use a set comprehension to produce a series of keys at a time to avoid having to look up and call the `set.add()` method in a loop; properly random, larger keys have a very small chance of producing duplicates anyway.

Because this is done in a tight loop, it is worth your while optimising away all name lookups as much as possible:

``````import secrets
import numpy as np
from functools import partial

def produce_amount_keys(amount_of_keys, _randint=np.random.randint):
keys = set()
pickchar = partial(secrets.choice, string.ascii_uppercase + string.digits)
while len(keys) < amount_of_keys:
keys |= {''.join([pickchar() for _ in range(_randint(12, 20))]) for _ in range(amount_of_keys - len(keys))}
return keys
``````

The `_randint` keyword argument binds the `np.random.randint` name to a local in the function, which are faster to reference than globals, especially when attribute lookups are involved.

The `pickchar()` partial avoids looking up attributes on modules or more locals; it is a single callable that has all the references in place, so is faster in execute, especially when done in a loop.

The `while` loop keeps iterating only if there were duplicates produced. We produce enough keys in a single set comprehension to fill the remainder if there are no duplicates.

## Timings for that first improvement

For 100 items, the difference is not that big:

``````>>> timeit('p(100)', 'from __main__ import produce_amount_keys_list as p', number=1000)
8.720592894009314
>>> timeit('p(100)', 'from __main__ import produce_amount_keys_set as p', number=1000)
7.680242831003852
``````

but when you start scaling this up, you’ll notice that the O(N) membership test cost against a list really drags your version down:

``````>>> timeit('p(10000)', 'from __main__ import produce_amount_keys_list as p', number=10)
15.46253142200294
>>> timeit('p(10000)', 'from __main__ import produce_amount_keys_set as p', number=10)
8.047800761007238
``````

My version is already almost twice as fast as 10k items; 40k items can be run 10 times in about 32 seconds:

``````>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_list as p', number=10)
138.84072386901244
>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_set as p', number=10)
32.40720253501786
``````

The list version took over 2 minutes, more than ten times as long.

## Numpy’s random.choice function, not cryptographically strong

You can make this faster still by forgoing the `secrets` module and using `np.random.choice()` instead; this won’t produce a cryptographic level randomness however, but picking a random character is twice as fast:

``````def produce_amount_keys(amount_of_keys, _randint=np.random.randint):
keys = set()
pickchar = partial(
np.random.choice,
np.array(list(string.ascii_uppercase + string.digits)))
while len(keys) < amount_of_keys:
keys |= {''.join([pickchar() for _ in range(_randint(12, 20))]) for _ in range(amount_of_keys - len(keys))}
return keys
``````

This makes a huge difference, now 10 times 40k keys can be produced in just 16 seconds:

``````>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_npchoice as p', number=10)
15.632006907981122
``````

## Further tweaks with the itertools module and a generator

We can also take the `unique_everseen()` function from the `itertools` module Recipes section to have it take care of the uniqueness, then use an infinite generator and the `itertools.islice()` function to limit the results to just the number we want:

``````# additional imports
from itertools import islice, repeat

# assumption: unique_everseen defined or imported

def produce_amount_keys(amount_of_keys):
pickchar = partial(
np.random.choice,
np.array(list(string.ascii_uppercase + string.digits)))
def gen_keys(_range=range, _randint=np.random.randint):
while True:
yield ''.join([pickchar() for _ in _range(_randint(12, 20))])
return list(islice(unique_everseen(gen_keys()), amount_of_keys))
``````

This is slightly faster still, but only marginally so:

``````>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_itertools as p', number=10)
14.698191125993617
``````

## os.urandom() bytes and a different method of producing strings

Next, we can follow on on Adam Barnes’s ideas for using UUID4 (which is basically just a wrapper around `os.urandom()`) and Base64. But by case-folding Base64 and replacing 2 characters with randomly picked ones, his method severely limits the entropy in those strings (you won’t produce the full range of unique values possible, a 20-character string only using `(256 ** 15) / (36 ** 20)` == 1 in every 99437 bits of entropy!).

The Base64 encoding uses both upper and lower case characters and digits but also adds the `-` and `/` characters (or `+` and `_` for the URL-safe variant). For only uppercase letters and digits, you’d have to uppercase the output and map those extra two characters to other random characters, a process that throws away a large amount of entropy from the random data provided by `os.urandom()`. Instead of using Base64, you could also use the Base32 encoding, which uses uppercase letters and the digits 2 through 8, so produces strings with 32 ** n possibilities versus 36 ** n. However, this can speed things up further from the above attempts:

``````import os
import base64
import math

def produce_amount_keys(amount_of_keys):
def gen_keys(_urandom=os.urandom, _encode=base64.b32encode, _randint=np.random.randint):
# (count / math.log(256, 32)), rounded up, gives us the number of bytes
# needed to produce *at least* count encoded characters
factor = math.log(256, 32)
input_length = [None] * 12 + [math.ceil(l / factor) for l in range(12, 20)]
while True:
count = _randint(12, 20)
yield _encode(_urandom(input_length[count]))[:count].decode('ascii')
return list(islice(unique_everseen(gen_keys()), amount_of_keys))
``````

This is really fast:

``````>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_b32 as p', number=10)
4.572628145979252
``````

40k keys, 10 times, in just over 4 seconds. So about 75 times as fast; the speed of using `os.urandom()` as a source is undeniable.

This is, cryptographically strong again; `os.urandom()` produces bytes for cryptographic use. On the other hand, we reduced the number of possible strings produced by more than 90% (`((36 ** 20) - (32 ** 20)) / (36 ** 20) * 100` is 90.5), we are no longer using the `0`, `1`, `8` and `9` digits in the outputs.

So perhaps we should use the `urandom()` trick to produce a proper Base36 encoding; we’ll have to produce our own `b36encode()` function:

``````import string
import math

def b36encode(b,
_range=range, _ceil=math.ceil, _log=math.log, _fb=int.from_bytes, _len=len, _b=bytes,
_c=(string.ascii_uppercase + string.digits).encode()):
"""Encode a bytes value to Base36 (uppercase ASCII and digits)

This isn't too friendly on memory because we convert the whole bytes
object to an int, but for smaller inputs this should be fine.
"""
b_int = _fb(b, 'big')
length = _len(b) and _ceil(_log((256 ** _len(b)) - 1, 36))
return _b(_c[(b_int // 36 ** i) % 36] for i in _range(length - 1, -1, -1))
``````

and use that:

``````def produce_amount_keys(amount_of_keys):
def gen_keys(_urandom=os.urandom, _encode=b36encode, _randint=np.random.randint):
# (count / math.log(256, 36)), rounded up, gives us the number of bytes
# needed to produce *at least* count encoded characters
factor = math.log(256, 36)
input_length = [None] * 12 + [math.ceil(l / factor) for l in range(12, 20)]
while True:
count = _randint(12, 20)
yield _encode(_urandom(input_length[count]))[-count:].decode('ascii')
return list(islice(unique_everseen(gen_keys()), amount_of_keys))
``````

This is reasonably fast, and above all produces the full range of 36 uppercase letters and digits:

``````>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_b36 as p', number=10)
8.099918447987875
``````

Granted, the base32 version is almost twice as fast as this one (thanks to an efficient Python implementation using a table) but using a custom Base36 encoder is still twice the speed of the non-cryptographically secure `numpy.random.choice()` version.

However, using `os.urandom()` produces bias again; we have to produce more bits of entropy than is required for between 12 to 19 base36 ‘digits’. For 17 digits, for example, we can’t produce 36 ** 17 different values using bytes, only the nearest equivalent of 256 ** 11 bytes, which is about 1.08 times too high, and so we’ll end up with a bias towards `A`, `B`, and to a lesser extent, `C` (thanks Stefan Pochmann for pointing this out).

## Picking an integer below `(36 ** length)` and mapping integers to base36

So we need to reach out to a secure random method that can give us values evenly distributed between `0` (inclusive) and `36 ** (desired length)` (exclusive). We can then map the number directly to the desired string.

First, mapping the integer to a string; the following has been tweaked to produce the output string the fastest:

``````def b36number(n, length, _range=range, _c=string.ascii_uppercase + string.digits):
"""Convert an integer to Base36 (uppercase ASCII and digits)"""
chars = [_c[0]] * length
while n:
length -= 1
chars[length] = _c[n % 36]
n //= 36
return ''.join(chars)
``````

Next, we need a fast and cryptographically secure method of picking our number in a range. You can still use `os.urandom()` for this, but then you’d have to mask the bytes down to a maximum number of bits, and then loop until your actual value is below the limit. This is actually already implemented, by the `secrets.randbelow()` function. In Python versions < 3.6 you can use `random.SystemRandom().randrange()`, which uses the exact same method with some extra wrapping to support a lower bound greater than 0 and a step size.

Using `secrets.randbelow()` the function becomes:

``````import secrets

def produce_amount_keys(amount_of_keys):
def gen_keys(_below=secrets.randbelow, _encode=b36number, _randint=np.random.randint):
limit = [None] * 12 + [36 ** l for l in range(12, 20)]
while True:
count = _randint(12, 20)
yield _encode(_below(limit[count]), count)
return list(islice(unique_everseen(gen_keys()), amount_of_keys))
``````

and this then is quite close to the (probably biased) base64 solution:

``````>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_below as p', number=10)
5.135716405988205
``````

This is almost as fast as the Base32 approach, but produces the full range of keys!

So it’s a speed race is it?

Building on the work of Martijn Pieters, I’ve got a solution which cleverly leverages another library for generating random strings: `uuid`.

My solution is to generate a `uuid4`, base64 encode it and uppercase it, to get only the characters we’re after, then slice it to a random length.

This works for this case because the length of outputs we’re after, (12-20), is shorter than the shortest base64 encoding of a uuid4. It’s also really fast, because `uuid` is very fast.

I also made it a generator instead of a regular function, because they can be more efficient.

Interestingly, using the standard library’s `randint` function was faster than `numpy`‘s.

Here is the test output:

``````Timing 40k keys 10 times with produce_amount_keys
20.899942063027993
Timing 40k keys 10 times with produce_amount_keys, stdlib randint
20.85920040300698
Timing 40k keys 10 times with uuidgen
3.852462349983398
Timing 40k keys 10 times with uuidgen, stdlib randint
3.136272903997451
``````

Here is the code for `uuidgen()`:

``````def uuidgen(count, _randint=np.random.randint):
generated = set()

while True:
if len(generated) == count:
return

candidate = b64encode(uuid4().hex.encode()).upper()[:_randint(12, 20)]
if candidate not in generated:
yield candidate
``````

And here is the entire project. (At commit d9925d at the time of writing).

Thanks to feedback from Martijn Pieters, I’ve improved the method somewhat, increasing the entropy, and speeding it up by a factor of about 1/6th.

There is still a lot of entropy lost in casting all lowercase letters to uppercase. If that’s important, then it’s possibly advisable to use `b32encode()` instead, which has the characters we want, minus `0`, `1`, `8`, and `9`.

The new solution reads as follows:

``````def urandomgen(count):
generated = set()

while True:
if len(generated) == count:
return

desired_length = randint(12, 20)

# # Faster than math.ceil
# urandom_bytes = urandom(((desired_length + 1) * 3) // 4)
#
# candidate = b64encode(urandom_bytes, b'//').upper()
#
# The above is rolled into one line to cut down on execution
# time stemming from locals() dictionary access.

candidate = b64encode(
urandom(((desired_length + 1) * 3) // 4),
b'//',
).upper()[:desired_length]

while b'/' in candidate:
candidate = candidate.replace(b'/', choice(ALLOWED_CHARS), 1)

if candidate not in generated:
yield candidate.decode()
``````

And the test output:

``````Timing 40k keys 10 times with produce_amount_keys, stdlib randint
19.64966493297834
Timing 40k keys 10 times with uuidgen, stdlib randint
4.063803717988776
Timing 40k keys 10 times with urandomgen, stdlib randint
2.4056471119984053
``````

The new commit in my repository is 5625fd.

Martijn’s comments on entropy got me thinking. The method I used with `base64` and `.upper()` makes letters SO much more common than numbers. I revisited the problem with a more binary mind on.

The idea was to take output from `os.urandom()`, interpret it as a long string of 6-bit unsigned numbers, and use those numbers as an index to a rolling array of the allowed characters. The first 6-bit number would select a character from the range `A..Z0..9A..Z01`, the second 6-bit number would select a character from the range `2..9A..Z0..9A..T`, and so on.

This has a slight crushing of entropy in that the first character will be slightly less likely to contain `2..9`, the second character less likely to contain `U..Z0`, and so on, but it’s so much better than before.

It’s slightly faster than `uuidgen()`, and slightly slower than `urandomgen()`, as shown below:

``````Timing 40k keys 10 times with produce_amount_keys, stdlib randint
20.440480664998177
Timing 40k keys 10 times with uuidgen, stdlib randint
3.430628580001212
Timing 40k keys 10 times with urandomgen, stdlib randint
2.0875444510020316
Timing 40k keys 10 times with bytegen, stdlib randint
2.8740892770001665
``````

I’m not entirely sure how to eliminate the last bit of entropy crushing; offsetting the start point for the characters will just move the pattern along a little, randomising the offset will be slow, shuffling the map will still have a period… I’m open to ideas.

The new code is as follows:

``````from os import urandom
from random import randint
from string import ascii_uppercase, digits

# Masks for extracting the numbers we want from the maximum possible
# length of `urandom_bytes`.
bitmasks = [(0b111111 << (i * 6), i) for i in range(20)]
allowed_chars = (ascii_uppercase + digits) * 16  # 576 chars long

def bytegen(count):
generated = set()

while True:
if len(generated) == count:
return

# Generate 9 characters from 9x6 bits
desired_length = randint(12, 20)
bytes_needed = (((desired_length * 6) - 1) // 8) + 1

# Endianness doesn't matter.
urandom_bytes = int.from_bytes(urandom(bytes_needed), 'big')

chars = [
allowed_chars[
(((urandom_bytes & bitmask) >> (i * 6)) + (0b111111 * i)) % 576
]
][:desired_length]

candidate = ''.join(chars)

if candidate not in generated:
yield candidate
``````

And the full code, along with a more in-depth README on the implementation, is over at de0db8.

I tried several things to speed the implementation up, as visible in the repo. Something that would definitely help is a character encoding where the numbers and ASCII uppercase letters are sequential.

# Alternate approach: Uniqueness in creation rather than by test

The obvious approach to your question would be to generate random output, and then check whether it is unique. Though I do not offer an implementation, here is an alternate approach:

1. Generate output that looks as random as possible
2. Generate output that is guaranteed to be unique, and looks somewhat random
3. Combine them

Now you have output that is guaranteed to be unique, and appears to be random.

## Example

Suppose you would want to generate 999999 strings with lengths from 12 and 20. The approach will of course work for all character sets, but lets keep it simple and assume you want to use only 0-9.

1. Generate random output with lengths from 6 to 14
2. Randomly permute the numbers 000000 to 999999 (yes 6 digits is quite a lot to ‘sacrifice’ in apparent randomness, but with a larger characterset you won’t need this many characters)
3. Now combine them in a way that the uniqueness must be preserved. The most trivial way would be simple concatenation of the entities, but you can of course think of less obvious solutions.

## Small scale example

1. Generate randomness:

sdfdsf
xxer
ver

2. Generate uniqueness

xd
ae
bd

3. Combine

xdsdfdsf
aexxer
bdver

Note that this method assumes that you have a minimum number of characters per entry, which seems to be the case in your question.

Caveat: This is not cryptographically secure. I want to give an alternative `numpy` approach to the one in Martijn’s great answer.

`numpy` functions aren’t really optimised to be called repeatedly in a loop for small tasks; rather, it’s better to perform each operation in bulk. This approach gives more keys than you need (massively so in this case because I over-exaggerated the need to overestimate) and so is less memory efficient but is still super fast.

1. We know that all your string lengths are between 12 and 20. Just generate all the string lengths in one go. We know that the final `set` has the possibility of trimming down the final list of strings, so we should anticipate that and make more “string lengths” than we need. 20,000 extra is excessive, but it’s to make a point:

`string_lengths = np.random.randint(12, 20, 60000)`

2. Rather than create all our sequences in a `for` loop, create a 1D list of characters that is long enough to be cut into 40,000 lists. In the absolute worst case scenario, all the random string lengths in (1) were the max length of 20. That means we need 800,000 characters.

`pool = list(string.ascii_letters + string.digits)`

`random_letters = np.random.choice(pool, size=800000)`

3. Now we just need to chop that list of random characters up. Using `np.cumsum()` we can get sequential starting indices for the sublists, and `np.roll()` will offset that array of indices by 1, to give a corresponding array of end indices.

`starts = string_lengths.cumsum()`

`ends = np.roll(string_lengths.cumsum(), -1)`

4. Chop up the list of random characters by the indices.

`final = [''.join(random_letters[starts[x]:ends[x]]) for x, _ in enumerate(starts)]`

Putting it all together:

``````def numpy_approach():
pool = list(string.ascii_letters + string.digits)
string_lengths = np.random.randint(12, 20, 60000)
ends = np.roll(string_lengths.cumsum(), -1)
starts = string_lengths.cumsum()
random_letters = np.random.choice(pool, size=800000)
final = [''.join(random_letters[starts[x]:ends[x]]) for x, _ in enumerate(starts)]
return final
``````

And `timeit` results:

``````322 ms ± 7.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
``````

A simple and fast one:

``````def b36(n, N, chars=string.ascii_uppercase + string.digits):
s = ''
for _ in range(N):
s += chars[n % 36]
n //= 36
return s

def produce_amount_keys(amount_of_keys):
keys = set()
while len(keys) < amount_of_keys:
N = np.random.randint(12, 20)
return keys
``````

Edit: The below refers to a previous revision of Martijn’s answer. After our discussion he added another solution to it, which is essentially the same as mine but with some optimizations. They don’t help much, though, it’s only about 3.4% faster than mine in my testing, so in my opinion they mostly just complicate things. —

Compared with Martijn’s final solution in his accepted answer mine is a lot simpler, about factor 1.7 faster, and not biased:

``````Stefan
8.246490597876106 seconds.
8 different lengths from 12 to 19
Least common length 19 appeared 124357 times.
Most common length 16 appeared 125424 times.
36 different characters from 0 to Z
Least common character Q appeared 429324 times.
Most common character Y appeared 431433 times.
36 different first characters from 0 to Z
Least common first character C appeared 27381 times.
Most common first character Q appeared 28139 times.
36 different last characters from 0 to Z
Least common last character Q appeared 27301 times.
Most common last character E appeared 28109 times.

Martijn
14.253227412021943 seconds.
8 different lengths from 12 to 19
Least common length 13 appeared 124753 times.
Most common length 15 appeared 125339 times.
36 different characters from 0 to Z
Least common character 9 appeared 428176 times.
Most common character C appeared 434029 times.
36 different first characters from 0 to Z
Least common first character 8 appeared 25774 times.
Most common first character A appeared 31620 times.
36 different last characters from 0 to Z
Least common last character Y appeared 27440 times.
Most common last character X appeared 28168 times.
``````

Martijn’s has a bias in the first character, `A` appears far too often and `8` far to seldom. I ran my test ten times, his most common first character was always `A` or `B` (five times each), and his least common character was always `7`, `8` or `9` (two, three and five times, respectively). I also checked the lengths separately, length 17 was particularly bad, his most common first character always appeared about 51500 times while his least common first character appeared about 25400 times.

Fun side note: I’m using the `secrets` module that Martijn dismissed 🙂

My whole script:

``````import string
import secrets
import numpy as np
import os
from itertools import islice, filterfalse
import math

#------------------------------------------------------------------------------------
#   Stefan
#------------------------------------------------------------------------------------

def b36(n, N, chars=string.ascii_uppercase + string.digits):
s = ''
for _ in range(N):
s += chars[n % 36]
n //= 36
return s

def produce_amount_keys_stefan(amount_of_keys):
keys = set()
while len(keys) < amount_of_keys:
N = np.random.randint(12, 20)
return keys

#------------------------------------------------------------------------------------
#   Martijn
#------------------------------------------------------------------------------------

def b36encode(b,
_range=range, _ceil=math.ceil, _log=math.log, _fb=int.from_bytes, _len=len, _b=bytes,
_c=(string.ascii_uppercase + string.digits).encode()):
b_int = _fb(b, 'big')
length = _len(b) and _ceil(_log((256 ** _len(b)) - 1, 36))
return _b(_c[(b_int // 36 ** i) % 36] for i in _range(length - 1, -1, -1))

def produce_amount_keys_martijn(amount_of_keys):
def gen_keys(_urandom=os.urandom, _encode=b36encode, _randint=np.random.randint, _factor=math.log(256, 36)):
while True:
count = _randint(12, 20)
yield _encode(_urandom(math.ceil(count / _factor)))[-count:].decode('ascii')
return list(islice(unique_everseen(gen_keys()), amount_of_keys))

#------------------------------------------------------------------------------------
#   Needed for Martijn
#------------------------------------------------------------------------------------

def unique_everseen(iterable, key=None):
seen = set()
if key is None:
for element in filterfalse(seen.__contains__, iterable):
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
yield element

#------------------------------------------------------------------------------------
#   Benchmark and quality check
#------------------------------------------------------------------------------------

from timeit import timeit
from collections import Counter

def check(name, func):
print()
print(name)

# Get 999999 keys and report the time.
keys = None
def getkeys():
nonlocal keys
keys = func(999999)
t = timeit(getkeys, number=1)
print(t, 'seconds.')

# Report statistics about lengths and characters
def statistics(label, values):
ctr = Counter(values)
least = min(ctr, key=ctr.get)
most = max(ctr, key=ctr.get)
print(len(ctr), f'different {label}s from', min(ctr), 'to', max(ctr))
print(f'  Least common {label}', least, 'appeared', ctr[least], 'times.')
print(f'  Most common {label}', most, 'appeared', ctr[most], 'times.')
statistics('length', map(len, keys))
statistics('character', ''.join(keys))
statistics('first character', (k[0] for k in keys))
statistics('last character', (k[-1] for k in keys))

for _ in range(2):
check('Stefan', produce_amount_keys_stefan)
check('Martijn', produce_amount_keys_martijn)
``````

simply

if you use python > 3.6

d = 100 (whatever length you want)

or if you want a random range

d = random.randrange(start=start_range,stop=stop_range)

``````import string
import random
random_str = ''.join(random.choices(string.ascii_uppercase +
string.digits, k = d))
``````

and if you need a more securely way

``````random_str = ''.join(secrets.choice(string.ascii_uppercase + string.digits)
for i in range(d))
``````
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.