Error 'Unable to parse file' when I run a custom template on dataflow

Question:

I’m trying to write a custom template to read a CSV and output it to another CSV. The objective is to select the desired data in this CSV. When I run it on the web interface I have the following error

I have reduced the code as much as possible to understand my error but I still don’t see it.
I helped myself to the documentation : https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#creating-and-staging-templates

class UploadOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_value_provider_argument(
            '--input',
            default='gs://[MYBUCKET]/input.csv',
            help='Path of the file to read from')
        parser.add_value_provider_argument(
            '--output',
            required=True,
            help='Output file to write results to.')

pipeline_options = PipelineOptions(['--output', 'gs://[MYBUCKET]/output'])
p = beam.Pipeline(options=pipeline_options)
upload_options = pipeline_options.view_as(UploadOptions)

(p
    | 'read' >> beam.io.Read(upload_options.input)
    | 'Write' >> beam.io.WriteToText(upload_options.output, file_name_suffix='.csv'))

The current error is as follows

Unable to parse file ‘gs://MYBUCKET/template.py’.

In the terminal I have the following error

ERROR: (gcloud.dataflow.jobs.run) FAILED_PRECONDITION: Unable to parse template file ‘gs://[MYBUCKET]/template.py’.
– ‘@type’: type.googleapis.com/google.rpc.PreconditionFailure
violations:
– description: “Unexpected end of stream : expected ‘{‘”
subject: 0:0
type: JSON

Thank you in advance

Asked By: Loïc

||

Answers:

I managed to solve my problem. The problem came from the variable I was using in the Read of my pipeline. The custom_options variable must be used in the Read and not the known_args variable

custom_options = pipeline_options.view_as(CustomPipelineOptions)

I made a generic code and I share my solution if anyone would need this.

from __future__ import absolute_import
import argparse

import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.metrics.metric import MetricsFilter
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, SetupOptions

class CustomPipelineOptions(PipelineOptions):
    """
    Runtime Parameters given during template execution
    path and organization parameters are necessary for execution of pipeline
    campaign is optional for committing to bigquery
    """
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_value_provider_argument(
            '--path',
            type=str,
            help='Path of the file to read from')
        parser.add_value_provider_argument(
            '--output',
            type=str,
            help='Output file if needed')

def run(argv=None):
    parser = argparse.ArgumentParser()
    known_args, pipeline_args = parser.parse_known_args(argv)

    global cloud_options
    global custom_options

    pipeline_options = PipelineOptions(pipeline_args)
    cloud_options = pipeline_options.view_as(GoogleCloudOptions)
    custom_options = pipeline_options.view_as(CustomPipelineOptions)
    pipeline_options.view_as(SetupOptions).save_main_session = True

    p = beam.Pipeline(options=pipeline_options)

    init_data = (p
                        | 'Hello World' >> beam.Create(['Hello World'])
                        | 'Read Input path' >> beam.Read(custom_options.path)
                 )

    result = p.run()
    # result.wait_until_finish

if __name__ == '__main__':
    run()

Then launch the following command to generate the templates on GCP

python template.py --runner DataflowRunner --project $PROJECT --staging_location gs://$BUCKET/staging --temp_location gs://$BUCKET/temp --
template_location gs://$BUCKET/templates/$TemplateName
Answered By: Loïc

I solved this issue by verifying that my Google Cloud CLI command had the correct syntax.

Answered By: Jacob Orellana