Access Apache Beam metrics values during pipeline run in python?
Question:
I’m using the direct runner of Apache Beam Python SDK to execute a simple pipeline similar to the word count example. Since I’m processing a large file, I want to display metrics during the execution. I know how to report the metrics, but I can’t find any way to access the metrics during the run.
I found the metrics()
function in the PipelineResult
, but it seems I only get a PipelineResult
object from the Pipeline.run()
function, which is a blocking call. In the Java SDK I found a MetricsSink
, which can be configured on PipelineOptions
, but I did not find an equivalent in the Python SDK.
How can I access live metrics during pipeline execution?
Answers:
The direct runner is generally used for testing, development, and small jobs, and Pipeline.run()
was made blocking for simplicity. On other runners Pipeline.run()
is asynchronous and the result can be used to monitor the pipeline progress during execution.
You could try running a local version of an OSS runner like Flink to get this behavior.
This seems to work with the DirectRunner:
counters = result.metrics().query(beam.metrics.MetricsFilter())['counters']
for metric in counters:
print(metric)
As @robertwb mentioned, DirectRunner
does not support this; but I think If you are running the pipeline locally, even with FlinkRunner
this is not supported. I was expecting Pipeline.run()
to be asynchronous but it is not. My pipeline is batch and when I debugged it, it sets DeploymentOptions.ATTACHED
to true
here which blocks Pipeline.run()
until the pipeline is done. I am guessing streaming mode does something similar here but I have not checked it. There is also this bug.
I’m using the direct runner of Apache Beam Python SDK to execute a simple pipeline similar to the word count example. Since I’m processing a large file, I want to display metrics during the execution. I know how to report the metrics, but I can’t find any way to access the metrics during the run.
I found the metrics()
function in the PipelineResult
, but it seems I only get a PipelineResult
object from the Pipeline.run()
function, which is a blocking call. In the Java SDK I found a MetricsSink
, which can be configured on PipelineOptions
, but I did not find an equivalent in the Python SDK.
How can I access live metrics during pipeline execution?
The direct runner is generally used for testing, development, and small jobs, and Pipeline.run()
was made blocking for simplicity. On other runners Pipeline.run()
is asynchronous and the result can be used to monitor the pipeline progress during execution.
You could try running a local version of an OSS runner like Flink to get this behavior.
This seems to work with the DirectRunner:
counters = result.metrics().query(beam.metrics.MetricsFilter())['counters']
for metric in counters:
print(metric)
As @robertwb mentioned, DirectRunner
does not support this; but I think If you are running the pipeline locally, even with FlinkRunner
this is not supported. I was expecting Pipeline.run()
to be asynchronous but it is not. My pipeline is batch and when I debugged it, it sets DeploymentOptions.ATTACHED
to true
here which blocks Pipeline.run()
until the pipeline is done. I am guessing streaming mode does something similar here but I have not checked it. There is also this bug.