apache-beam

Dataflow Template launch failure on pickling

Dataflow Template launch failure on pickling Question: My Dataflow pipeline is as followed pipeline_options = PipelineOptions( pipeline_args, streaming=True, save_main_session=True, sdk_location="container" ) with Pipeline(options=pipeline_options) as pipeline: ( pipeline | f"Read event topic" >> io.ReadFromPubSub(topic=input_topic).with_output_types(bytes) | "Convert to string" >> beam.Map(lambda msg: msg.decode("utf=8")) | f"Transform event" >> beam.Map(transform_message, event_name=event_name) | f"Write to output topic" >> beam.Map(publish_to_output_topic) ) …

Total answers: 1

Apache Beam pass list as argument – Python SDK

Apache Beam pass list as argument – Python SDK Question: I have an Apache Beam pipeline which would take a list as arguments and use it in the Filter and Map function. Since these would be available as string I had converted using ast.literal_eval on them. Is there any other better way to do the …

Total answers: 1

How to deploy BigQuery and access BigQuery to public internet?

How to deploy BigQuery and access BigQuery to public internet? Question: I want to BigQuery as datawarehouse in my company. How BigQuery can access to all of data team member? Is install BigQuery in SSH Compute Engine ? Or don’t need install BigQuery in SSH Compute Engine and just give user access in IAM then …

Total answers: 2

Reading GPKG format file over Apache Beam

Reading GPKG format file over Apache Beam Question: I have a requirement to parse and load gpgk extension file to Bigquery table through apache beam (Dataflow runner). I could see that beam has feature called Geobeam, but i couldn’t see reference for loading of gpgk files. Q1: Which Beam library can help me to load …

Total answers: 1

Apache Beam – ReadFromText safely (pass over errors)

Apache Beam – ReadFromText safely (pass over errors) Question: I have a simple Apache Beam pipeline which reads compressed bz2 files and writes them out to text files. import apache_beam as beam p1 = beam.Pipeline() (p1 | ‘read’ >> beam.io.ReadFromText(‘bad_file.bz2’) | ‘write’ >> beam.io.WriteToText(‘file_out.txt’) ) p1.run() The problem is when the pipeline encounters a bad …

Total answers: 1

In GCP Dataflow/Apache Beam Python SDK, is there a time limit for DoFn.process?

In GCP Dataflow/Apache Beam Python SDK, is there a time limit for DoFn.process? Question: In Apache Beam Python SDK running on GCP Dataflow, I have a DoFn.process that takes a long time. My DoFn takes a long time for reasons that are not that important – I have to accept them due to requirements out …

Total answers: 1

Can I sort the items in an Apache beam PCollection using python?

Can I sort the items in an Apache beam PCollection using python? Question: Can I sort the items in an Apache beam PCollection using python? I need to perform an operation (transformation) that relies on the items to be sorted. But so far, I cannot find any trace of a "sorting" mechanism for the Apache …

Total answers: 1

Apache Beam – combine input with DoFn output

Apache Beam – combine input with DoFn output Question: I have a DoFn class with process method, which takes a string and enhance it: class LolString(apache_beam.DoFn): def process(self, element: str) -> str: return element + "_lol" I want to have a step in my Beam pipeline that gives me a tuple, for example: "Stack" -> …

Total answers: 1

Dataflow Job to start based on PubSub Notification – Python

Dataflow Job to start based on PubSub Notification – Python Question: I am writing a Dataflow job which reads from BigQuery and does a few transformations. data = ( pipeline | beam.io.ReadFromBigQuery(query=”’ SELECT * FROM `bigquery-public-data.chicago_crime.crime` LIMIT 100 ”’, use_standard_sql=True) | beam.Map(print) ) But my requirement is to read from BigQuery only after receiving a …

Total answers: 2