Variable length element sampling from list based on weights

Question:

I want to select some items from a list depending on a given weight for each element. The length of the output is not known. This needs to be done a lot of times.

So, say I have a list of [id, probability] like [[1, 0.2], [2, 0.3], [3, 0.45], [4, 0.05], [5,1.]], I want to get something like [[1,3,5], [5], [3,5], [5], [1,2,4,5], ...]

Here’s my code to do so. It works, but it’s pretty slow (the list is long, more than 10,000 elements of [id, probability], and my result is thousands of selection long). Do you know of any way to make this (much) faster?

import numpy as np

items  = [[1, 0.2], [2, 0.3], [3, 0.45], [4, 0.05], [5,1.]]

combinations = []
for n in range(1000):
    selection = []
    for i in items:
        chosen = np.random.choice([True, False], p=[i[1], 1.-i[1]])
        if chosen:
            selection.append(i[0])
    combinations.append(selection)
Asked By: William Abma

||

Answers:

You can vectorize the sampling step as follows:

import numpy as np

# items  = [[1, 0.2], [2, 0.3], [3, 0.45], [4, 0.05], [5,1.]]
items = [(i, np.random.rand()) for i in range(1000)]

def sample_original(itms, n=1000):
  combinations = []
  for n in range(n):
      selection = []
      for i in items:
          chosen = np.random.choice([True, False], p=[i[1], 1.-i[1]])
          if chosen:
              selection.append(i[0])
      combinations.append(selection)


def sample_numpy(itms, n=1000):
  elts, probs = np.array(itms).T
  m = len(elts)
  return [elts[np.random.rand(m) < probs] for _ in range(n)]

The main observation is np.random.rand(m) < probs gives a random vector of True/False that selects elements of the original list with correct probabilities. This seems to be 1000x faster when there are 1000 items:

%timeit sample_numpy(items)
%timeit sample_original(items)
10 loops, best of 3: 22.6 ms per loop
1 loop, best of 3: 21.9 s per loop

If the probabilities are relatively high (resulting selections are not sparse), you might want to store the selection indicators in a 2D array for further performance gains.

Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.