When using Apache Beam Python with GCP Dataflow, does it matter if you materialize the results of GroupByKey?

Question:

When using Apache Beam Python with GCP Dataflow, is there a downside to materializing the results of GroupByKey, say, to count the number of elements. For example:

def consume_group_by_key(element):
    season, fruits = element

    for fruit in fruits:
        yield f"{fruit} grows in {season}"


def consume_group_by_key_materialize(element):
    season, fruits = element

    num_fruits = len(list(fruits))
    print(f"There are {num_fruits} fruits grown in {season}")

    for fruit in fruits:
        yield f"{fruit} grows in {season}"


(
    pipeline
    | 'Create produce counts' >> beam.Create([
        ('spring', 'strawberry'),
        ('spring', 'carrot'),
        ('spring', 'eggplant'),
        ('spring', 'tomato'),
        ('summer', 'carrot'),
        ('summer', 'tomato'),
        ('summer', 'corn'),
        ('fall', 'carrot'),
        ('fall', 'tomato'),
        ('winter', 'eggplant'),
    ])
    | 'Group counts per produce' >> beam.GroupByKey()
    | beam.ParDo(consume_group_by_key_generator)
)

Are the grouped values, fruits, passed to my DoFn as a generator? Is there a performance penalty for using consume_group_by_key_materialize instead of consume_group_by_key? Or in other words materializing fruits via something like len(list(fruits))? If there are billions of fruits will this use up all my memory?

Asked By: cozos

||

Answers:

You are correct, len(list(fruits)) will materialize the entire list in memory before taking it’s size, whereas your consume_group_by_key function iterates over the iterable once and (on distributed runners like Dataflow) does not require bringing the entire set of values into memory at once.

Answered By: robertwb