Correct way to define an apache beam pipepline

Question:

I am new to Beam and struggling to find many good guides and resources to learn best practices.

One thing I have noticed is there are two ways pipelines are defined:

with beam.Pipeline() as p:
# pipeline code in here

Or

p = beam.Pipeline()
# pipeline code in here
result = p.run()
result.wait_until_finish()

Are there specific situations in which each method is preferred?

Asked By: dendog

||

Answers:

From code snippets, I see the main difference is if you care about pipeline result or not. If you want to use PipelineResult to monitor pipeline status or or cancel your pipeline by your code, you can go to the second style.

Answered By: Rui Wang

I think functional wise they are equivalent since the __exit__ function for pipeline context manager is executing the same code.
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L426

Answered By: Yichi Zhang

As pointed out by Yichi Zhang, Pipeline.__exit__ set .result, so you can do:

with beam.Pipeline() as p:
  ...

result = p.result

The contextmanager version is cleaner as it can correctly cleanup when error are raised inside the contextmanager.

Answered By: Conchylicultor

Are there specific situations in which each method is preferred?

To answer this question, you can take a look at the Pipeline context manager implementation source code here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L587

As you can see, the context manager run the pipeline and block it until it finish

self.result = self.run()
self.result.wait_until_finish()

Exactly as you can do explicitly with the second approach, and a lot more.

So the general rule here is: if you need the control to decide if you want your pipeline to block or not, go with the second approach, otherwise use the context manager.

Answered By: fuyi

If you are deploying streaming pipelines I suggest to use the second option and don’t call the wait_until_finish function so your pipeline gets deployed but your code won’t wait until the end of time.

Answered By: mani