Maximal Length of List to Shuffle with Python random.shuffle?

Question:

I have a list which I shuffle with the Python built in shuffle function (random.shuffle)

However, the Python reference states:

Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that most permutations of a long sequence can never be generated.

Now, I wonder what this “rather small len(x)” means. 100, 1000, 10000,…

Asked By: Henrik

||

Answers:

What they mean is that permutations on n objects (noted n!) grows absurdly high very fast.

Basically n! = n x n-1 x … x 1; for example, 5! = 5 x 4 x 3 x 2 x 1 = 120 which means there are 120 possible ways of shuffling a 5-items list.

On the same Python page documentation they give 2^19937-1 as the period, which is 4.something × 10^6001 or something. Based on the Wikipedia page on factorials, I guess 2000! should be around that. (Sorry, I didn’t find the exact figure.)

So basically there are so many possible permutations the shuffle will take from that there’s probably no real reason to worry about those it won’t.

But if it really is an issue (pesky customer asking for a guarantee of randomness perhaps?), you could also offload the task to some third-party; see http://www.random.org/ for example.

Answered By: Joubarc

TL;DR: It “breaks” on lists with over 2080 elements, but don’t worry too much 🙂

Complete answer:

First of all, notice that “shuffling” a list can be understood (conceptually) as generating all possible permutations of the elements of the lists, and picking one of these permutations at random.

Then, you must remember that all self-contained computerised random number generators are actually “pseudo” random. That is, they are not actually random, but rely on a series of factors to try and generate a number that is hard to be guessed in advanced, or purposefully reproduced. Among these factors is usually the previous generated number. So, in practice, if you use a random generator continuously a certain number of times, you’ll eventually start getting the same sequence all over again (this is the “period” that the documentation refers to).

Finally, the docstring on Lib/random.py (the random module) says that “The period [of the random number generator] is 2**19937-1.”

So, given all that, if your list is such that there are 2**19937 or more permutations, some of these will never be obtained by shuffling the list. You’d (again, conceptually) generate all permutations of the list, then generate a random number x, and pick the xth permutation. Next time, you generate another random number y, and pick the yth permutation. And so on. But, since there are more permutations than you’ll get random numbers (because, at most after 2**19937-1 generated numbers, you’ll start getting the same ones again), you’ll start picking the same permutations again.

So, you see, it’s not exactly a matter of how long your list is (though that does enter into the equation). Also, 2**19937-1 is quite a long number. But, still, depending on your shuffling needs, you should bear all that in mind. On a simplistic case (and with a quick calculation), for a list without repeated elements, 2081 elements would yield 2081! permutations, which is more than 2**19937.

Answered By: rbp

I wrote that comment in the Python source originally, so maybe I can clarify 😉

When the comment was introduced, Python’s Wichmann-Hill generator had a much shorter period, and we couldn’t even generate all the permutations of a deck of cards.

The period is astronomically larger now, and 2080 is correct for the current upper bound. The docs could be beefed up to say more about that – but they’d get awfully tedious.

There’s a very simple explanation: A PRNG of period P has P possible starting states. The starting state wholly determines the permutation produced. Therefore a PRNG of period P cannot generate more than P distinct permutations (and that’s an absolute upper bound – it may not be achieved). That’s why comparing N! to P is the correct computation here. And, indeed:

>>> math.factorial(2080) > 2**19937 - 1
False
>>> math.factorial(2081) > 2**19937 - 1
True
Answered By: Tim Peters

I want to answer this question because I faced the Mersenne Twister limitation.
Let me explain. I had a list of 7200 items with a lot of repetitions that looks like this :

array = [0 for i in range(5920)] + [1 for i in range(1200)] + [2 for i in range(80)]

If you do the maths, it should be 7200! / (5920! * 1200! * 80!). However, it is pretty hard to compute and compare to 2080! because both numerator and denominator are huge.
Anyway, with a bit of simplifications, one can come to the conclusion that IT IS bigger than 2080!. In addition, both random.shuffle() and random.sample() suffer the same limitation.

Here is a solution I came up with. I’d appreciate any criticism about this method because I’m not really good with statistics :

# Python 3.11.4
from random import Random
class Shuffle :
    def __init__(self, array : list[int]) -> None :
        self._array : list[int] = array
        self._seed : Random = Random()
        self.length : int = len(array)

    def choice(self) -> float :
        if self.length > 0 :
            pick : int = self._seed.choice(self._array)
            self._array.remove(pick)
            self.length -= 1
            return pick
        raise IndexError("No more item in array !")

This way, the array is shuffled with only length random generated number vs the crazy amount of permutations I was talking about.

To make things unambiguous, this class is meant to be used the following way :

array = [0 for i in range(5920)] + [1 for i in range(1200)] + [2 for i in range(80)]
shuffler = Shuffle(array)
shuffled_array = [shuffler.choice() for i in range(shuffler.length)]
del shuffler

So the IndexError would never be raised even if you call it several times, unless you iterate more than length.

I need such algorithm because random.choice() alone doesn’t guarantee I still have the right amount of 0, 1 and 2 in the end.

I hope this could help people who try to bypass the Mersenne Twister RNG implementation for shuffling even though the initial question doesn’t go that far.

Answered By: Pehnny
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.