GCP Dataflow – NoneType error during WriteToBigQuery()

Question:

I’m trying to transfer data in a csv file from GCS to BQ using beam but I get a NoneType error when I call WriteToBigQuery. The error message:

AttributeError: 'NoneType' object has no attribute 'items' [while running 'Write to BQ/_StreamToBigQuery/StreamInsertRows/ParDo(BigQueryWriteFn)']

My pipeline code:

import apache_beam as beam
from apache_beam.pipeline import PipelineOptions
from apache_beam.io.textio import ReadFromText


options = {
    'project': project,
    'region': region,
    'temp_location': bucket
    'staging_location': bucket
    'setup_file': './setup.py'
}


class Split(beam.DoFn):
    def process(self, element):
        n, cc = element.split(",")
        return [{
            'n': int(n.strip('"')),
            'connection_country': str(cc.strip()),
        }]


pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)

with beam.Pipeline(options=pipeline_options) as pipeline:
    (pipeline
        | 'Read from GCS' >> ReadFromText('file_path*', skip_header_lines=1)
        | 'parse input' >> beam.ParDo(Split())
        | 'print' >> beam.Map(print)
        | 'Write to BQ' >> beam.io.WriteToBigQuery(
            'from_gcs', 'demo', schema='n:INTEGER, connection_country:STRING',
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
        )

My csv looks like this:

enter image description here

And the beam excerpt at the print() stage looks like this:

enter image description here

Appreciate any help!

Asked By: dj20b22

||

Answers:

You are getting that error because the print function does not return anything, so no elements go to the WriteToBQ step. You can fix it with:

def print_fn(element):
    print(element)
    return element

{..}
        | 'print' >> beam.Map(print_fn) # Note that now I'm referencing to the fn
        | 'Write to BQ' >> beam.io.WriteToBigQuery(
{..}

Also, if you are going to run this in Dataflow, the print is not going to appear, but you can use logging.info()

Answered By: Iñigo

You can filter out None type messages with

def filter_none_messages(msg):
    print(F"Message filtered: {msg}")
    return msg

and add | "FilterNoneMessages" >> beam.Filter(filter_none_messages) in your pipeline.

Answered By: Kanti Kumari