Access Apache Beam metrics values during pipeline run in python?

Question:

I’m using the direct runner of Apache Beam Python SDK to execute a simple pipeline similar to the word count example. Since I’m processing a large file, I want to display metrics during the execution. I know how to report the metrics, but I can’t find any way to access the metrics during the run.

I found the metrics() function in the PipelineResult, but it seems I only get a PipelineResult object from the Pipeline.run() function, which is a blocking call. In the Java SDK I found a MetricsSink, which can be configured on PipelineOptions, but I did not find an equivalent in the Python SDK.

How can I access live metrics during pipeline execution?

Asked By: aKzenT

||

Answers:

The direct runner is generally used for testing, development, and small jobs, and Pipeline.run() was made blocking for simplicity. On other runners Pipeline.run() is asynchronous and the result can be used to monitor the pipeline progress during execution.

You could try running a local version of an OSS runner like Flink to get this behavior.

Answered By: robertwb

This seems to work with the DirectRunner:

counters = result.metrics().query(beam.metrics.MetricsFilter())['counters']
for metric in counters:
    print(metric)
Answered By: Reinaldo Aguiar

As @robertwb mentioned, DirectRunner does not support this; but I think If you are running the pipeline locally, even with FlinkRunner this is not supported. I was expecting Pipeline.run() to be asynchronous but it is not. My pipeline is batch and when I debugged it, it sets DeploymentOptions.ATTACHED to true here which blocks Pipeline.run() until the pipeline is done. I am guessing streaming mode does something similar here but I have not checked it. There is also this bug.

Answered By: bashir
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.