How to update rows in a BigQuery table using airflow
Question:
I am coding a DAG and want to execute an UPDATE
statement to selectively set the values of certain fields in certain rows. The SQL statement is easy, but I am not sure how to execute it via Airflow.
The documentation on BigQueryUpdateTableOperator
here says that the entire dataset will be replaced. I tried searching for a long time and could not find the right operator.
I tried putting an UPDATE
statement in BigQueryInsertJobOperator
and that threw an error.
How do I execute an UPDATE
query on BigQuery via Airflow? My DAG is within a GCP Composer environment.
Answers:
I used BigQueryInsertJobOperator
and was able to use UPDATE
statement by storing it in an SQL file and then calling the sql file in the query
parameter.
Please see below code I used in my testing:
from airflow import models
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow.utils.dates import days_ago
dag_id = "update-dag"
my_final_taskid = 'update-bq'
sql_file = 'my-query.sql'
with models.DAG(
dag_id,
schedule_interval=None, # Override to match your needs
start_date=days_ago(1),
tags=["example"],
) as dag:
update_bq_table = BigQueryInsertJobOperator(
task_id=my_final_taskid,
configuration={
"query": {
"query": sql_file,
"useLegacySql": False,
}
},
)
Content of my my-query.sql
:
update your-dataset.your-table set your_column = 'string' where another_column = 'string';
You have different options to do this. First of all:
The UpdateTable, CreateTable etc.. {}Table Operators are used to modify something related to the table itself as the schema, structure, metadata, or partition; that’s why they create again the table.
About the Update statement, there are a couple of ways to do it. Since the BQExecuteQuery is deprecated you should go using InsertJob Operator, as the comment above mentioned.
Info from this one if you want to check this more.: https://registry.astronomer.io/providers/google/modules/bigqueryinsertjoboperator
Also one important note: using UPDATE statements are costly in BQ, since it is not thought to do in big volumes of data. I suggest you to use it only for a few data, in another way, you can do something like outer joins and create temporal tables to do this, it would be cheaper.
I am coding a DAG and want to execute an UPDATE
statement to selectively set the values of certain fields in certain rows. The SQL statement is easy, but I am not sure how to execute it via Airflow.
The documentation on BigQueryUpdateTableOperator
here says that the entire dataset will be replaced. I tried searching for a long time and could not find the right operator.
I tried putting an UPDATE
statement in BigQueryInsertJobOperator
and that threw an error.
How do I execute an UPDATE
query on BigQuery via Airflow? My DAG is within a GCP Composer environment.
I used BigQueryInsertJobOperator
and was able to use UPDATE
statement by storing it in an SQL file and then calling the sql file in the query
parameter.
Please see below code I used in my testing:
from airflow import models
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow.utils.dates import days_ago
dag_id = "update-dag"
my_final_taskid = 'update-bq'
sql_file = 'my-query.sql'
with models.DAG(
dag_id,
schedule_interval=None, # Override to match your needs
start_date=days_ago(1),
tags=["example"],
) as dag:
update_bq_table = BigQueryInsertJobOperator(
task_id=my_final_taskid,
configuration={
"query": {
"query": sql_file,
"useLegacySql": False,
}
},
)
Content of my my-query.sql
:
update your-dataset.your-table set your_column = 'string' where another_column = 'string';
You have different options to do this. First of all:
The UpdateTable, CreateTable etc.. {}Table Operators are used to modify something related to the table itself as the schema, structure, metadata, or partition; that’s why they create again the table.
About the Update statement, there are a couple of ways to do it. Since the BQExecuteQuery is deprecated you should go using InsertJob Operator, as the comment above mentioned.
Info from this one if you want to check this more.: https://registry.astronomer.io/providers/google/modules/bigqueryinsertjoboperator
Also one important note: using UPDATE statements are costly in BQ, since it is not thought to do in big volumes of data. I suggest you to use it only for a few data, in another way, you can do something like outer joins and create temporal tables to do this, it would be cheaper.