apache-beam

Python Apache Beam error "InvalidSchema: No connection adapters were found for" when request api url with spaces

Python Apache Beam error "InvalidSchema: No connection adapters were found for" when request api url with spaces Question: Following example from Apache Beam Pipeline to read from REST API runs locally but not on Dataflow pipeline requests data from api with response = requests.get(url, auth=HTTPDigestAuth(self.USER, self.PASSWORD), headers=headers) where url string url = "https://host:port/car(‘power%203’)/speed" Pipeline fails …

Total answers: 1

couldn't write data to postgres using apache beam

couldn't write data to postgres using apache beam Question: i am trying to use Beam to read a csv and send data to postgres. But the pipeline is failing due to a conversion mismatch. note that this pipeline work when the 2 column are of type int and fail when the type of column contains …

Total answers: 1

Apache Beam Pipeline runs with DirectRunner, but fails with DataflowRunner (SDK harness sdk-0-0 disconnected) during initial read step

Apache Beam Pipeline runs with DirectRunner, but fails with DataflowRunner (SDK harness sdk-0-0 disconnected) during initial read step Question: TL;DR We have a default VPC. Tried to run dataflow job. Initial step (Read file) manages to process 1/2 steps. Get JOB_MESSAGE_ERROR: SDK harness sdk-0-0 disconnected error message, but nothing else in the logs. Have tried …

Total answers: 2

DataflowRunner "Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase" using SlidingWindows yet DirectRunner works?

DataflowRunner "Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase" using SlidingWindows yet DirectRunner works? Question: Why does Dataflow generate the following error when joining two streams where one has been windowed into sliding windows? TypeError: Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase [while running ‘B/Map(_from_proto_str)-ptransform-24’] I have created a reproducible example below that works on DirectRunner, but produces the error …

Total answers: 1

How to replace commas with semi-colon except commas in quotes Apache beam python

How to replace commas with semi-colon except commas in quotes Apache beam python Question: I want to replace commas from text and replace them with semi-colons except for the commas that are in quotation marks. The text lines look like this: ‘1001,838,"Calabash, Water Spinach",2000-01-01’ I tried by creating a DoFn class function which I then …

Total answers: 2

Apache beam – look back x mins from each element

Apache beam – look back x mins from each element Question: I am trying to calculate the total number of transactions done by each customer in last x min. Let’s say there are a total of 3 elements, I would like to look back last 5 minutes and find the sum for each customer. {"event_time": …

Total answers: 2

What do operators '>>' and '|' mean in this case?

What do operators '>>' and '|' mean in this case? Question: import apache_beam as beam with beam.pipeline() as pipeline: lines = pipeline | ‘ReadMyFile’ >> beam.io.ReadFromText( ‘gs://some/inputData.txt’) What I know is that ‘>>’ means shift right and ‘|’ is logical or. However, I do not understand what is their purpose here? Asked By: Ahmed Jabareen …

Total answers: 2

asynchronous API calls in apache beam

asynchronous API calls in apache beam Question: As the title says, I want to make asynchronous API calls in apache beam using python. Currently, I am calling the API inside a DoFn for each element in the Pcollection. DoFn code class textapi_call(beam.DoFn): def __init__(self, api_key): self.api_key = api_key def setup(self): self.session = requests.session() def process(self, …

Total answers: 2

wait_until_finished() returns UNKNOWN does not wait for pipeline to complete

wait_until_finished() returns UNKNOWN does not wait for pipeline to complete Question: We have a Dataflow pipeline which begins with extracting data from BigQuery and the data are then written to CSV in a Google Bucket using apache_beam.io‘s WriteToText function. Because the files are sharded we need to run a piece of code to merge the …

Total answers: 2

Apache Beam Cloud Dataflow Streaming Stuck Side Input

Apache Beam Cloud Dataflow Streaming Stuck Side Input Question: I’m currently building PoC Apache Beam pipeline in GCP Dataflow. In this case, I want to create streaming pipeline with main input from PubSub and side input from BigQuery and store processed data back to BigQuery. Side pipeline code side_pipeline = ( p | "periodic" >> …

Total answers: 1