How can I partition `itertools.combinations` such that I can process the results in parallel?

Question:

I have a massive quantity of combinations (86 choose 10, which yields 3.5 trillion results) and I have written an algorithm which is capable of processing 500,000 combinations per second. I would not like to wait 81 days to see the final results, so naturally I am inclined to separate this into many processes to be handled by my many cores.

Consider this naive approach:

import itertools
from concurrent.futures import ProcessPoolExecutor

def algorithm(combination):
  # returns a boolean in roughly 1/500000th of a second on average

def process(combinations):
  for combination in combinations:
    if algorithm(combination):
      # will be very rare (a few hundred times out of trillions) if that matters
      print("Found matching combination!", combination) 

combination_generator = itertools.combinations(eighty_six_elements, 10)

# My system will have 64 cores and 128 GiB of memory
with ProcessPoolExecutor(workers=63) as executor:
  # assign 1,000,000 combinations to each process
  # it may be more performant to use larger batches (to avoid process startup overhead)
  # but eventually I need to start worrying about running out of memory
  group = []
  for combination in combination_generator:
    group.append(combination)
    if len(group) >= 1_000_000:
      executor.submit(process, group)
      group = []

This code "works", but it has virtually no performance gain over a single-threaded approach, since it is bottlenecked by the generation of the combinations for combination in combination_generator.

How can I pass this computation off to the child-processes so that it can be parallelized? How can each process generate a specific subset of itertools.combinations?

p.s. I found this answer, but it only deals with generating single specified elements, whereas I need to efficiently generate millions of specified elements.

Asked By: Gaberocksall

||

Answers:

I’m the author of one answer to the question you already found for generating the combination at a given index. I’d start with that: Compute the total number of combinations, divide that by the number of equally sized subsets you want, then compute the cut-over for each of them. Then do your subprocess tasks with these combinations as bounds. Within each subprocess you’d do the iteration yourself, not using itertools. It’s not hard:

def next_combination(n: int, c: list[int]):
    """Compute next combination, in lexicographical order.

    Args:
      n: the number of items to choose from.
      c: a list of integers in strictly ascending order,
         each of them between 0 (inclusive) and n (exclusive).
         It will get modified by the call.
    Returns: the list c after modification,
         or None if this was the last combination.
    """
    i = len(c)
    while i > 0:
        i -= 1
        n -= 1
        if c[i] == n: continue
        c[i] += 1
        for j in range(i + 1, len(c)):
            c[j] = c[j - 1] + 1
        return c
    return None

Note that both the code above and the one from my other answer assume that you are looking for combinations of elements from range(n). If you want to combine other elements, do combinations for this range then use the elements of the found combinations as indices into your sequence of actual things you want to combine.

The main advantage of the approach above is that it ensures equal batch size, which might be useful if processing time is expected to be mostly determined by batch size. If processing time still varies greatly even for batches of the same size, that might be too much effort. I’ll post an alternative answer addressing that.

Answered By: MvG

You can do a recursive divide-and-conquer approach, where you make a decision based on the expected number of combinations. If it is small, use itertools. If it is large, handle the case of the first element being included and of it being excluded both in recursive calls.

The result does not ensure batches of equal size, but it does give you an upper bound on the size of each batch. If processing time of each batch is somewhat varied anyway, that might be good enough.

T = typing.TypeVar('T')
def combination_batches(
        seq: collections.abc.Sequence[T],
        r: int,
        max_batch_size: int,
        prefix: tuple[T, ...] = ()
    ) -> collections.abc.Iterator[collections.abc.Iterator[tuple[T, ...]]]:
    """Compute batches of combinations.

    Each yielded value is itself a generator over some of the combinations.
    Taken together they produce all the combinations.

    Args:
      seq: The sequence of elements to choose from.
      r: The number of elements to include in each combination.
      max_batch_size: How many elements each returned iterator
        is allowed to iterate over.
      prefix: Used during recursive calls, prepended to each returned tuple.
    Yields: generators which each generate a subset of all the combinations,
      in a way that generators together yield every combination exactly once.
    """
    if math.comb(len(seq), r) > max_batch_size:
        # One option: first element taken.
        yield from combination_batches(
            seq[1:], r - 1, max_batch_size, prefix + (seq[0],))
        # Other option: first element not taken.
        yield from combination_batches(
            seq[1:], r, max_batch_size, prefix)
        return
    yield (prefix + i for i in itertools.combinations(seq, r))

See https://ideone.com/GD6WYl for a more compete demonstration.

Note that I don’t know how well the process pool executor deals with generators as arguments, whether it is able to just forward a short description of each. There is a chance that in order to ship the generator to a subprocess it will actually generate all the values. So instead of yielding the generator expression the way I did, you might want to yield some object which pickles more nicely but still offers iteration over the same values.

Answered By: MvG
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.