google-cloud-dataflow

Apache Beam Pipeline runs with DirectRunner, but fails with DataflowRunner (SDK harness sdk-0-0 disconnected) during initial read step

Apache Beam Pipeline runs with DirectRunner, but fails with DataflowRunner (SDK harness sdk-0-0 disconnected) during initial read step Question: TL;DR We have a default VPC. Tried to run dataflow job. Initial step (Read file) manages to process 1/2 steps. Get JOB_MESSAGE_ERROR: SDK harness sdk-0-0 disconnected error message, but nothing else in the logs. Have tried …

Total answers: 2

DataflowRunner "Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase" using SlidingWindows yet DirectRunner works?

DataflowRunner "Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase" using SlidingWindows yet DirectRunner works? Question: Why does Dataflow generate the following error when joining two streams where one has been windowed into sliding windows? TypeError: Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase [while running ‘B/Map(_from_proto_str)-ptransform-24’] I have created a reproducible example below that works on DirectRunner, but produces the error …

Total answers: 1

asynchronous API calls in apache beam

asynchronous API calls in apache beam Question: As the title says, I want to make asynchronous API calls in apache beam using python. Currently, I am calling the API inside a DoFn for each element in the Pcollection. DoFn code class textapi_call(beam.DoFn): def __init__(self, api_key): self.api_key = api_key def setup(self): self.session = requests.session() def process(self, …

Total answers: 2

wait_until_finished() returns UNKNOWN does not wait for pipeline to complete

wait_until_finished() returns UNKNOWN does not wait for pipeline to complete Question: We have a Dataflow pipeline which begins with extracting data from BigQuery and the data are then written to CSV in a Google Bucket using apache_beam.io‘s WriteToText function. Because the files are sharded we need to run a piece of code to merge the …

Total answers: 2

Apache Beam Cloud Dataflow Streaming Stuck Side Input

Apache Beam Cloud Dataflow Streaming Stuck Side Input Question: I’m currently building PoC Apache Beam pipeline in GCP Dataflow. In this case, I want to create streaming pipeline with main input from PubSub and side input from BigQuery and store processed data back to BigQuery. Side pipeline code side_pipeline = ( p | "periodic" >> …

Total answers: 1

Job graph too large to submit to Google Cloud Dataflow

Job graph too large to submit to Google Cloud Dataflow Question: I am trying to run a job on Dataflow, and whenever I try to submit it to run with DataflowRunner, I receive the following errors from the service: { "code" : 400, "errors" : [ { "domain" : "global", "message" : "Request payload size …

Total answers: 2

Side Input data doesn't get updated – Python Apache Beam

Side Input data doesn't get updated – Python Apache Beam Question: I’m building a pipeline with dynamic configuration data that gets updated whenever is triggered. There are 2 PubSub topics, topic A for the IoT data, topic B is for the configuration that will be used to transform the IoT data. The configuration is kept …

Total answers: 2

Import Error: No module named 'google.cloud' on ApacheBeam

Import Error: No module named 'google.cloud' on ApacheBeam Question: I get an import error with importing apache beam’s google datastore api.I have one version of Python 3 installed on my Windows 10 64-bit system. Can somebody help me? I have try to solve it but i can’t # -*- coding: utf-8 -*- import apache_beam as …

Total answers: 4

Why did I encounter an "Error syncing pod" with Dataflow pipeline?

Why did I encounter an "Error syncing pod" with Dataflow pipeline? Question: I experiment a weird error with my Dataflow pipeline when I want to use specific library from PyPI. I need jsonschema in a ParDo, so, in my requirements.txtfile, I added jsonschema==3.2.0. I launch my pipeline with the command line below: python -m gcs_to_all …

Total answers: 2

Ways of using value provider parameter in Python Apache Beam

Ways of using value provider parameter in Python Apache Beam Question: Right now I’m just able to grab the RunTime value inside a class using a ParDo, is there another way to get to use the runtime parameter like in my functions? This is the code I got right now: class UserOptions(PipelineOptions): @classmethod def _add_argparse_args(cls, …

Total answers: 1