google-cloud-dataflow

Dataflow Template launch failure on pickling

Dataflow Template launch failure on pickling Question: My Dataflow pipeline is as followed pipeline_options = PipelineOptions( pipeline_args, streaming=True, save_main_session=True, sdk_location="container" ) with Pipeline(options=pipeline_options) as pipeline: ( pipeline | f"Read event topic" >> io.ReadFromPubSub(topic=input_topic).with_output_types(bytes) | "Convert to string" >> beam.Map(lambda msg: msg.decode("utf=8")) | f"Transform event" >> beam.Map(transform_message, event_name=event_name) | f"Write to output topic" >> beam.Map(publish_to_output_topic) ) …

Total answers: 1

Apache Beam pass list as argument – Python SDK

Apache Beam pass list as argument – Python SDK Question: I have an Apache Beam pipeline which would take a list as arguments and use it in the Filter and Map function. Since these would be available as string I had converted using ast.literal_eval on them. Is there any other better way to do the …

Total answers: 1

How to deploy BigQuery and access BigQuery to public internet?

How to deploy BigQuery and access BigQuery to public internet? Question: I want to BigQuery as datawarehouse in my company. How BigQuery can access to all of data team member? Is install BigQuery in SSH Compute Engine ? Or don’t need install BigQuery in SSH Compute Engine and just give user access in IAM then …

Total answers: 2

Reading GPKG format file over Apache Beam

Reading GPKG format file over Apache Beam Question: I have a requirement to parse and load gpgk extension file to Bigquery table through apache beam (Dataflow runner). I could see that beam has feature called Geobeam, but i couldn’t see reference for loading of gpgk files. Q1: Which Beam library can help me to load …

Total answers: 1

Apache Beam – ReadFromText safely (pass over errors)

Apache Beam – ReadFromText safely (pass over errors) Question: I have a simple Apache Beam pipeline which reads compressed bz2 files and writes them out to text files. import apache_beam as beam p1 = beam.Pipeline() (p1 | ‘read’ >> beam.io.ReadFromText(‘bad_file.bz2’) | ‘write’ >> beam.io.WriteToText(‘file_out.txt’) ) p1.run() The problem is when the pipeline encounters a bad …

Total answers: 1

In GCP Dataflow/Apache Beam Python SDK, is there a time limit for DoFn.process?

In GCP Dataflow/Apache Beam Python SDK, is there a time limit for DoFn.process? Question: In Apache Beam Python SDK running on GCP Dataflow, I have a DoFn.process that takes a long time. My DoFn takes a long time for reasons that are not that important – I have to accept them due to requirements out …

Total answers: 1

Dataflow Job to start based on PubSub Notification – Python

Dataflow Job to start based on PubSub Notification – Python Question: I am writing a Dataflow job which reads from BigQuery and does a few transformations. data = ( pipeline | beam.io.ReadFromBigQuery(query=”’ SELECT * FROM `bigquery-public-data.chicago_crime.crime` LIMIT 100 ”’, use_standard_sql=True) | beam.Map(print) ) But my requirement is to read from BigQuery only after receiving a …

Total answers: 2

Python Apache Beam error "InvalidSchema: No connection adapters were found for" when request api url with spaces

Python Apache Beam error "InvalidSchema: No connection adapters were found for" when request api url with spaces Question: Following example from Apache Beam Pipeline to read from REST API runs locally but not on Dataflow pipeline requests data from api with response = requests.get(url, auth=HTTPDigestAuth(self.USER, self.PASSWORD), headers=headers) where url string url = "https://host:port/car(‘power%203’)/speed" Pipeline fails …

Total answers: 1

GCP Dataflow Kafka and missing SSL certificates

GCP Dataflow Kafka and missing SSL certificates Question: I’m trying to fetch the data from Kafka to Bigquery using GCP Dataflow. My Dataflow template is based on Python SDK 2.42 + Container registry + apache_beam.io.kafka. There is my pipeline: def run( bq_dataset, bq_table_name, project, pipeline_options ): with Pipeline(options=pipeline_options) as pipeline: kafka = pipeline | ReadFromKafka( …

Total answers: 1