Two functions, One generator

Question:

I have two functions which both take iterators as inputs. Is there a way to write a generator which I can supply to both functions as input, which would not require a reset or a second pass through? I want to do one pass over the data, but supply the output to two functions: Example:

def my_generator(data):
    for row in data:
        yield row

gen = my_generator(data)
func1(gen)
func2(gen)

I know I could have two different generator instances, or reset in between functions, but was wondering if there is a way to avoid doing two passes on the data. Note that func1/func2 themselves are NOT generators, which would be nice cause I could then have a pipeline.

The point here is to try and avoid a second pass over the data.

Asked By: bcollins

||

Answers:

Python has an amazing catalog of handy functions. You find the ones related to iterators in itertools:

import itertools

def my_generator(data):
    for row in data:
        yield row

gen = my_generator(data)
gen1, gen2 = itertools.tee(gen)
func1(gen1)
func2(gen2)

However, this only makes sense if func1 and func2 don’t consume all the elements, because if they do itertools.tee()has to remember all the elements in genuntil gen2 is used.

To get around this, use only a few elements at a time. Or change func1 to call func2. Or maybe even change func1 to be a lazy generator which returns the input and just pipe that into func2.

Answered By: Georg Schölly

You can either cache generators result into a list, or reset the generator to pass data into func2. The problem is that if one have 2 loops, one needs to iterate over the data twice, so either one loads the data again and create a generator or one caches the entire result.

Solutions like itertools.tee will also just create 2 iteratvies, which is basically the same as resetting the generator after first iteration. Of course it is syntactic sugar but it won’t change the situation in the background.

If you have big data here, you have to merge func1 and func2.

for a in gen:
   f1(a)
   f2(a)

In practice it can be a good idea to design code like this, so one has full control over iteration process and is able associate/compose maps and filters using a single iterative.

Answered By: Nicolas Heimann

If using threads is an option, the generator may be consumed just once without having to store a possibly unpredictable number of yielded values between calls to the consumers. The following example runs the consumers in lock-step; Python 3.2 or later is needed for this implementation:

import threading


def generator():
    for x in range(10):
        print('generating {}'.format(x))
        yield x


def tee(iterable, n=2):
    barrier = threading.Barrier(n)
    state = dict(value=None, stop_iteration=False)

    def repeat():
        while True:
            if barrier.wait() == 0:
                try:
                    state.update(value=next(iterable))
                except StopIteration:
                    state.update(stop_iteration=True)
            barrier.wait()
            if state['stop_iteration']:
                break
            yield state['value']

    return tuple(repeat() for i in range(n))


def func1(iterable):
    for x in iterable:
        print('func1 consuming {}'.format(x))


def func2(iterable):
    for x in iterable:
        print('func2 consuming {}'.format(x))


gen1, gen2 = tee(generator(), 2)

thread1 = threading.Thread(target=func1, args=(gen1,))
thread1.start()

thread2 = threading.Thread(target=func2, args=(gen2,))
thread2.start()

thread1.join()
thread2.join()
Answered By: Thomas Lotze

It is a little bit too late, but maybe will be helpful for someone. For simplicity I have added only one ChildClass, but the idea is to have multiple of them:

class BaseClass:
    def on_yield(self, value: int):
        raise NotImplementedError()
    def summary(self):
        raise NotImplementedError()

class ChildClass(BaseClass):
    def __init__(self):
        self._aggregated_value = 0
    def on_yield(self, value: int):
        self._aggregated_value += value
    def summary(self):
        print(f"Aggregated value={self._aggregated_value}")

class Generator():
    def my_generator(self, data):
        for row in data:
            yield row
    def calculate(self, generator, classes):
        for index, value in enumerate(generator):
            print(f"index={index}")
            [_class.on_yield(value) for _class in classes]
        [_class.summary() for _class in classes]

if __name__ == '__main__':
    child_classes = [ ChildClass(), ChildClass() ]
    generator = Generator()
    my_generator = generator.my_generator([1, 2, 3])
    generator.calculate(my_generator, child_classes)

The output of this is:

index=0
index=1
index=2
Aggregated value=6
Aggregated value=6
Answered By: Artur Tomczak
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.