When using Apache Beam Python with GCP Dataflow, does it matter if you materialize the results of GroupByKey?
Question:
When using Apache Beam Python with GCP Dataflow, is there a downside to materializing the results of GroupByKey, say, to count the number of elements. For example:
def consume_group_by_key(element):
season, fruits = element
for fruit in fruits:
yield f"{fruit} grows in {season}"
def consume_group_by_key_materialize(element):
season, fruits = element
num_fruits = len(list(fruits))
print(f"There are {num_fruits} fruits grown in {season}")
for fruit in fruits:
yield f"{fruit} grows in {season}"
(
pipeline
| 'Create produce counts' >> beam.Create([
('spring', 'strawberry'),
('spring', 'carrot'),
('spring', 'eggplant'),
('spring', 'tomato'),
('summer', 'carrot'),
('summer', 'tomato'),
('summer', 'corn'),
('fall', 'carrot'),
('fall', 'tomato'),
('winter', 'eggplant'),
])
| 'Group counts per produce' >> beam.GroupByKey()
| beam.ParDo(consume_group_by_key_generator)
)
Are the grouped values, fruits
, passed to my DoFn as a generator? Is there a performance penalty for using consume_group_by_key_materialize
instead of consume_group_by_key
? Or in other words materializing fruits
via something like len(list(fruits))
? If there are billions of fruits will this use up all my memory?
Answers:
You are correct, len(list(fruits))
will materialize the entire list in memory before taking it’s size, whereas your consume_group_by_key
function iterates over the iterable once and (on distributed runners like Dataflow) does not require bringing the entire set of values into memory at once.
When using Apache Beam Python with GCP Dataflow, is there a downside to materializing the results of GroupByKey, say, to count the number of elements. For example:
def consume_group_by_key(element):
season, fruits = element
for fruit in fruits:
yield f"{fruit} grows in {season}"
def consume_group_by_key_materialize(element):
season, fruits = element
num_fruits = len(list(fruits))
print(f"There are {num_fruits} fruits grown in {season}")
for fruit in fruits:
yield f"{fruit} grows in {season}"
(
pipeline
| 'Create produce counts' >> beam.Create([
('spring', 'strawberry'),
('spring', 'carrot'),
('spring', 'eggplant'),
('spring', 'tomato'),
('summer', 'carrot'),
('summer', 'tomato'),
('summer', 'corn'),
('fall', 'carrot'),
('fall', 'tomato'),
('winter', 'eggplant'),
])
| 'Group counts per produce' >> beam.GroupByKey()
| beam.ParDo(consume_group_by_key_generator)
)
Are the grouped values, fruits
, passed to my DoFn as a generator? Is there a performance penalty for using consume_group_by_key_materialize
instead of consume_group_by_key
? Or in other words materializing fruits
via something like len(list(fruits))
? If there are billions of fruits will this use up all my memory?
You are correct, len(list(fruits))
will materialize the entire list in memory before taking it’s size, whereas your consume_group_by_key
function iterates over the iterable once and (on distributed runners like Dataflow) does not require bringing the entire set of values into memory at once.