Airflow declare timestamp as a variables but got the different timestamp for each Task

Question:

Im exporting data from Postgresql to GCS and then GCS to BQ. The DAG looks good. But in the export path Im using the current timestamp like below.

import datetime

date=str(datetime.datetime.now().strftime("%Y-%m-%d-%H-%M"))
year=date.split('-')[0]
month=date.split('-')[1]
day=date.split('-')[2]
hour=date.split('-')[3]
minutes=date.split('-')[4]
export_suffix = year + '/' + month + '/' + day + '/' + hour + '/' + minutes

So the TASK1(export to GCS) exported the data into this path.

gs://bucket/jobs/export_tbl/2020/09/27/07/12/file.csv

But the export process took 2 mins, then the GCS to BQ dag started and its failed. When I checked the log, it is showing GCS URI not found and its looking for the path as

gs://bucket/jobs/export_tbl/2020/09/27/07/14/file.csv

I have declared the variable at the top on my DAG.

Asked By: TheDataGuy

||

Answers:

You shouldn’t set the path like this due to multiple reasons:

  • It sounds like you put this code snippet on top of the python file, meaning it will be executed every time airflow parses through the dag folder (and then it will reassign the variable)
  • If you plan to rerun any dag run, the output path will be different. The dag isn’t idempotent which makes it harder to debug & reproduce results

I guess you are rather looking for execution_date which is the logical date and time which the DAG Run, and its task instances, are running for. Airflow supports accessing this via macros.

Answered By: Philipp Johannis
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.