Airflow declare timestamp as a variables but got the different timestamp for each Task
Question:
Im exporting data from Postgresql to GCS and then GCS to BQ. The DAG looks good. But in the export path Im using the current timestamp like below.
import datetime
date=str(datetime.datetime.now().strftime("%Y-%m-%d-%H-%M"))
year=date.split('-')[0]
month=date.split('-')[1]
day=date.split('-')[2]
hour=date.split('-')[3]
minutes=date.split('-')[4]
export_suffix = year + '/' + month + '/' + day + '/' + hour + '/' + minutes
So the TASK1(export to GCS) exported the data into this path.
gs://bucket/jobs/export_tbl/2020/09/27/07/12/file.csv
But the export process took 2 mins, then the GCS to BQ dag started and its failed. When I checked the log, it is showing GCS URI not found and its looking for the path as
gs://bucket/jobs/export_tbl/2020/09/27/07/14/file.csv
I have declared the variable at the top on my DAG.
Answers:
You shouldn’t set the path like this due to multiple reasons:
- It sounds like you put this code snippet on top of the python file, meaning it will be executed every time airflow parses through the dag folder (and then it will reassign the variable)
- If you plan to rerun any dag run, the output path will be different. The dag isn’t idempotent which makes it harder to debug & reproduce results
I guess you are rather looking for execution_date which is the logical date and time which the DAG Run, and its task instances, are running for. Airflow supports accessing this via macros.
Im exporting data from Postgresql to GCS and then GCS to BQ. The DAG looks good. But in the export path Im using the current timestamp like below.
import datetime
date=str(datetime.datetime.now().strftime("%Y-%m-%d-%H-%M"))
year=date.split('-')[0]
month=date.split('-')[1]
day=date.split('-')[2]
hour=date.split('-')[3]
minutes=date.split('-')[4]
export_suffix = year + '/' + month + '/' + day + '/' + hour + '/' + minutes
So the TASK1(export to GCS) exported the data into this path.
gs://bucket/jobs/export_tbl/2020/09/27/07/12/file.csv
But the export process took 2 mins, then the GCS to BQ dag started and its failed. When I checked the log, it is showing GCS URI not found and its looking for the path as
gs://bucket/jobs/export_tbl/2020/09/27/07/14/file.csv
I have declared the variable at the top on my DAG.
You shouldn’t set the path like this due to multiple reasons:
- It sounds like you put this code snippet on top of the python file, meaning it will be executed every time airflow parses through the dag folder (and then it will reassign the variable)
- If you plan to rerun any dag run, the output path will be different. The dag isn’t idempotent which makes it harder to debug & reproduce results
I guess you are rather looking for execution_date which is the logical date and time which the DAG Run, and its task instances, are running for. Airflow supports accessing this via macros.