Dataflow BigQuery to BigQuery

Question:

I am trying to create a dataflow script that goes from BigQuery back to BigQuery. Our main table is massive and breaks the extraction capabilities. I’d like to create a simple table (as a result of a query) containing all the relevant information.

The SQL query 'Select * from table.orders where paid = false limit 10' is a simple one to make sure it works. The true query is more complex but connects to multiple tables within the same project.

This seems to work but I’d like to know what I can do to test it out?
Also, How can I get this to run automatically every morning?

import logging
import argparse
import apache_beam as beam

PROJECT='experimental'
BUCKET='temp1/python2'


def run():
    argv = [
        '--project={0}'.format(PROJECT),
        '--job_name=test1',
        '--save_main_session',
        '--staging_location=gs://{0}/staging/'.format(BUCKET),
        '--temp_location=gs://{0}/staging/'.format(BUCKET),
        '--runner=DataflowRunner'
    ]

    with beam.Pipeline(argv=argv) as p:

        # Read the table rows into a PCollection.
        rows = p | 'read' >> beam.io.Read(beam.io.BigQuerySource(
                query =  'Select * from `table.orders` where paid = false limit 10', 
                use_standard_sql=True))

        # Write the output using a "Write" transform that has side effects.
        rows  | 'Write' >> beam.io.WriteToBigQuery(
                table='orders_test',
                dataset='external',
                project='experimental',
                schema='field1:type1,field2:type2,field3:type3',
                create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)


if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()
Asked By: SpasticCamel

||

Answers:

Running daily: https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions

Testing – you can try running against a smaller data set to test it. If you are running user code (not just read / write) you can test by using data from a file and checking expected results. But since you are just doing a read / write you would need to test using bigquery.

Answered By: Lara Schmidt

You can schedule a run every morning using AIRFLOW.
All you need is the DataFlowPythonOperator which executes your DataFlow Pipeline stored in a .py file.

For example, give your working DataFlow Pipeline in the script my_dataflow_pipe.py:

import argparse
import apache_beam as beam


def run():
    parser = argparse.ArgumentParser(description='Pipeline BGQ2BGQ')
    parser.add_argument('--job_name', required=True, type=str)
    parser.add_argument('--query', required=True, type=str)
    parser.add_argument('--project', required=True, type=str)
    parser.add_argument('--region', required=True, type=str)
    parser.add_argument('--dataset', required=True, type=str)
    parser.add_argument('--table', required=True, type=str)
    parser.add_argument('--network', required=True, type=str)
    parser.add_argument('--subnetwork', required=True, type=str)
    parser.add_argument('--machine_type', required=True, type=str)
    parser.add_argument('--max_num_workers', required=True, type=int)
    parser.add_argument('--num_workers', required=True, type=int)
    parser.add_argument('--temp_location', required=True, type=str)
    parser.add_argument('--runner', required=True, type=str)
    parser.add_argument('--labels', required=True, type=str)

    opts = parser.parse_args()
    query = opts.query.replace("n", " ")
    argv = [
        f"--job_name={opts.job_name}", 
        f"--project={opts.project}", f"--region={opts.region}", 
        f"--network={opts.network}", f"--subnetwork={opts.subnetwork}",
        f"--num_workers={opts.num_workers}", f"--max_num_workers={opts.max_num_workers}",
        f"--runner={opts.runner}", f"--temp_location={opts.temp_location}", 
        f"--machine_type={opts.machine_type}", f"--labels={opts.labels}", 
    ]
        
    with beam.Pipeline(argv=argv) as p:

        rows = p | 'read' >> beam.io.Read(
                beam.io.ReadFromBigQuery(query=query, use_standard_sql=True)
        )

        rows  | 'Write' >> beam.io.WriteToBigQuery(
                table=f'{opts.project}:{opts.dataset}.{opts.table}',
                schema='field1:type1,field2:type2,field3:type3',
                create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE
        )
    
    
if __name__ == '__main__':
    run()

You can build your AIRFLOW dag to trigger the execution of the DataFlow Pipeline:

import datetime
from airflow import models
from airflow.contrib.operators.dataflow_operator import DataFlowPythonOperator

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email': [''],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 0,
}
template_searchpath="/home/airflow/gcs/data/"

with models.DAG(
        'MYDAGNAME',
        catchup=False,
        default_args=default_args,
        template_searchpath=template_searchpath,
        start_date=datetime.datetime.now() - datetime.timedelta(days=3),
        schedule_interval='0 4 * * *',  # every day at 04:00 AM UTC
) as dag:
        
    job_name = f"MYJOB-{datetime.datetime.now().strftime('%Y%m%d%H%M')}"
    query = "SELECT field1, field2, field3 FROM MYPROJECT.XXX.xxx"
    dataflow_pyjob = DataFlowPythonOperator(
        task_id="dataflow_pyjob",
        job_name=job_name,
        py_file=template_searchpath+"my_dataflow_pipe.py",
        gcp_conn_id='MY_GCP_CONN_ID',
        options={
            'job_name':job_name, 'query':query, 
            'project':'MYPROJECT', 'region':'MYREGION',
            'dataset':'MYDATASET', 'table':'MYTAB', 
            'network':'MYNET', 'subnetwork':'MYSUBNET',
            'machine_type':'MYMACHTYPE', 
            'max_num_workers':'MYMNW', 'num_workers':'MYNW',
            'runner':'DataflowRunner', 'tempLocation':'MYTMPLOC',
        },
        wait_until_finished=True,
    )

Where the options param contains all the arguments required and provided to my_dataflow_pipe.py (labels is automatically populated by airflow).

Answered By: Marco Cerliani

I don’t have enough reputation to comment, but the link on the accepted answer is broken. The updated url is:

https://cloud.google.com/blog/products/data-analytics/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions

Answered By: Thomas