Pick N distinct items at random from sequence of unknown length, in only one iteration

Question:

I am trying to write an algorithm that would pick N distinct items from an sequence at random, without knowing the size of the sequence in advance, and where it is expensive to iterate over the sequence more than once. For example, the elements of the sequence might be the lines of a huge file.

I have found a solution when N=1 (that is, “pick exactly one element at random from a huge sequence”):

import random
items = range(1, 10) # Imagine this is a huge sequence of unknown length
count = 1
selected = None
for item in items:
    if random.random() * count < 1:
        selected = item
    count += 1

But how can I achieve the same thing for other values of N (say, N=3)?

Asked By: akonsu

||

Answers:

Use reservoir sampling. It’s a very simple algorithm that works for any N.

Here is one Python implementation, and here is another.

Answered By: NPE

As aix mentioned reservoir sampling works. Another option is generate a random number for every number you see and select the top k numbers.

To do it iteratively, maintain a heap of k (random number, number) pairs and whenever you see a new number insert to the heap if it is greater than smallest value in the heap.

Answered By: ElKamina

It should be enough to accept or reject each new item just once, and, if you accept it, throw out a randomly chosen old item.

Suppose you have selected N items of K at random and you see a (K+1)th item. Accept it with probability N/(K+1) and its probabilities are OK. The current items got in with probability N/K, and get thrown out with probability (N/(K+1))(1/N) = 1/(K+1) so survive through with probability (N/K)(K/(K+1)) = N/(K+1) so their probabilities are OK too.

And yes I see somebody has pointed you to reservoir sampling – this is one explanation of how that works.

Answered By: mcdowella

@NPE is correct, but the implementations that are being linked to are sub-optimal and not very “pythonic”. Here’s a better implementation:

def sample(iterator, k):
    """
    Samples k elements from an iterable object.

    :param iterator: an object that is iterable
    :param k: the number of items to sample
    """
    # fill the reservoir to start
    result = [next(iterator) for _ in range(k)]

    n = k - 1
    for item in iterator:
        n += 1
        s = random.randint(0, n)
        if s < k:
            result[s] = item

    return result

Edit As @panda-34 pointed out the original version was flawed, but not because I was using randint vs randrange. The issue is that my initial value for n didn’t account for the fact that randint is inclusive on both ends of the range. Taking this into account fixes the issue. (Note: you could also use randrange since it’s inclusive on the minimum value and exclusive on the maximum value.)

Answered By: JesseBuesking

If your sequence is short enough that reading it into memory and randomly sorting it is acceptable, then a straightforward approach would be to just use random.shuffle:

import random
arr=[1,2,3,4]

# In-place shuffle
random.shuffle(arr)

# Take the first 2 elements of the now randomized array
print arr[0:2]
[1, 3]

Depending upon the type of your sequence, you may need to convert it to a list by calling list(your_sequence) on it, but this will work regardless of the types of the objects in your sequence.

Naturally, if you can’t fit your sequence into memory or the memory or CPU requirements of this approach are too high for you, you will need to use a different solution.

Answered By: Carl Bellingan

Following will give you N random items from an array X

import random
list(map(lambda _: random.choice(X), range(N)))
Answered By: Shubham Chaudhary

This was my answer to a duplicate question (closed before I could post) that was somewhat related (“generating random numbers without any duplicates”). Since, it is a different approach than the other answers, I’ll leave it here in case it provides additional insight.

from random import randint

random_nums = []
N = # whatever number of random numbers you want
r = # lower bound of number range
R = # upper bound of number range

x = 0

while x < N:
    random_num = randint(r, R) # inclusive range
    if random_num in random_nums:
        continue
    else:
        random_nums.append(random_num)
        x += 1

The reason for the while loop over the for loop is that it allows for easier implementation of non-skipping in random generation (i.e. if you get 3 duplicates, you won’t get N-3 numbers).

Answered By: tooty44

If you have python version of 3.6+ you can use choices

from random import choices

items = range(1, 10)
new_items = choices(items, k = 3)

print(new_items) 
[6, 3, 1]
Answered By: Christof Henkel

Simplest I’ve found is this answer in SO, improved a bit below:

import random

my_list = [1, 2, 3, 4, 5]
how_big = 2

new_list = random.sample(my_list, how_big)

# To preserve the order of the list, you could do:
randIndex = random.sample(range(len(my_list)), how_big)
randIndex.sort()
new_list = [my_list[i] for i in randIndex]
Answered By: Solomon Vimal

There’s one implementation from the numpy library.

Assuming that N is smaller than the length of the array, you’d have to do the following:

# my_array is the array to be sampled from
assert N <= len(my_array)
indices = np.random.permutation(N)  # Generates shuffled indices from 0 to N-1
sampled_array = my_array[indices]

If you need to sample the whole array and not just the first N positions, then you can use:

import random
sampled_array = my_array[random.sample(len(my_array), N)]
Answered By: learner
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.