Python very slow random sampling over big list

Question:

I’m expecting very slow performance with the algorithm below.
I’ve a very large (1.000.000+) list containing large strings.

ie: id_list = ['MYSUPERLARGEID:1123:123123', 'MYSUPERLARGEID:1123:134534389', 'MYSUPERLARGEID:1123:12763']...

num_reads is the max number of elements to random choose from this list.
The idea is to randomly choose one of the string ids in id_list until num_reads is reached and to add (I say add, and not append because I don’t care on random_id_list order) them into random_id_list which is empty at the beginning.

I can’t repeat same id so I remove it from the original list after being randonly chosen. I suspect this is what is doing the script to go real slow.. maybe I’m wrong and it’s another part of this loop the responsible of slow behavior.

for x in xrange(0, num_reads):
    id_index, id_string = random.choice(list(enumerate(id_list)))
    random_id_list.append(id_string)
    del read_id_list[id_index]
Asked By: gmarco

||

Answers:

Use random.sample() to produce a sample of N elements with no repeats:

random_id_list = random.sample(read_id_list, num_reads)

Removing elements from the middle of a large list is indeed slow, as everything beyond that index has to be moved up a step.

This does not, of course, remove elements from the original list anymore, so repeated random.sample() calls can still give you samples with elements that have been picked before. If you need to produce samples repeatedly until your list is exhausted, then shuffle once and from there on out take consecutive slices of k elements from the shuffled list:

def random_samples(k):
    random.shuffle(id_list)
    for i in range(0, len(id_list), k):
        yield id_list[i : i + k]

then use this to produce your samples; either in a loop or with next():

sample_gen = random_samples(num_reads)
random_id_list = next(sample_gen)
# some point later
another_random_id_list = next(sample_gen)

Because the list is shuffled entirely randomly, the slices produced this way are also all valid random samples.

Answered By: Martijn Pieters

The “hard” way, instead of just shuffling the list, is to evaluate each element of your list in order, and selecting the item with a probability that relies on both the number of items you still need to choose and the number of items left to choose from. This is useful if you don’t have the entire list presented to you at once (a so-called on-line algorithm).

Let’s say you need to select k of N items. That means each item has a k/N probability of being chosen, if you can consider all items at once. However, if you accept the first item, then you only need to select k-1 items from N-1 remaining items. If you reject it, you still need k items from N-1 remaining items. So the algorithm would look like

N = len(id_list)
k = 10  # For example
choices = []
for i in id_list:
    if random.randint(1,N) <= k:
        choices.append(i)
        k -= 1
    N -= 1

Initially, the first item is chosen with the expected probability of k/N. As you go through your list, N steadily decreases, while k decreases as you actually accept items. Note that each item, overall, still has a p = k/N chance of being chosen. As an example, consider the second item in the list. Let pi be the probability that you choose the ith element in the list. p1 is obviously k/N, given the starting values of k and N. Consider p2 for example.

p2 = p1 * (k-1) / (N-1) + (1-p1) * k / (N-1)
   = (p1*k - p1 + k - k*p1) / (N-1)
   = (k - p1)/(N-1)
   = (k - k/N)/(N-1)
   = k/(N-1) - k/(N*(N-1)
   = (k*N - k)/(N*(N-1))
   = k/N

Similar (but longer) analysis holds for p3, p4, etc.

Answered By: chepner
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.