External files in Airflow DAG
Question:
I’m trying to access external files in a Airflow Task to read some sql, and I’m getting “file not found”. Has anyone come across this?
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
dag = DAG(
'my_dat',
start_date=datetime(2017, 1, 1),
catchup=False,
schedule_interval=timedelta(days=1)
)
def run_query():
# read the query
query = open('sql/queryfile.sql')
# run the query
execute(query)
tas = PythonOperator(
task_id='run_query', dag=dag, python_callable=run_query)
The log state the following:
IOError: [Errno 2] No such file or directory: 'sql/queryfile.sql'
I understand that I could simply copy and paste the query inside the same file, it’s really not at neat solution. There are multiple queries and the text is really big, embed it with the Python code would compromise readability.
Answers:
All relative paths are taken in reference to the AIRFLOW_HOME environment variable. Try:
- Giving absolute path
- place the file relative to AIRFLOW_HOME
- try logging the PWD in the python callable and then decide what path to give (Best option)
Here is an example use Variable to make it easy.
-
First add Variable in Airflow UI
-> Admin
-> Variable
, eg. {key: 'sql_path', values: 'your_sql_script_folder'}
-
Then add following code in your DAG, to use Variable from Airflow you just add.
DAG code:
import airflow
from airflow.models import Variable
tmpl_search_path = Variable.get("sql_path")
dag = airflow.DAG(
'tutorial',
schedule_interval="@daily",
template_searchpath=tmpl_search_path, # this
default_args=default_args
)
-
Now you can use sql script name or path under folder Variable
-
You can learn more in this
Assuming that the sql
directory is relative to the current Python file, you can figure out the absolute path to the sql file like this:
import os
CUR_DIR = os.path.abspath(os.path.dirname(__file__))
def run_query():
# read the query
query = open(f"{CUR_DIR}/sql/queryfile.sql")
# run the query
execute(query)
you can get DAG directory like below.
conf.get('core', 'DAGS_FOLDER')
# open file
open(os.path.join(conf.get('core', 'DAGS_FOLDER'), 'something.json'), 'r')
ref: https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dags-folder
I’m trying to access external files in a Airflow Task to read some sql, and I’m getting “file not found”. Has anyone come across this?
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
dag = DAG(
'my_dat',
start_date=datetime(2017, 1, 1),
catchup=False,
schedule_interval=timedelta(days=1)
)
def run_query():
# read the query
query = open('sql/queryfile.sql')
# run the query
execute(query)
tas = PythonOperator(
task_id='run_query', dag=dag, python_callable=run_query)
The log state the following:
IOError: [Errno 2] No such file or directory: 'sql/queryfile.sql'
I understand that I could simply copy and paste the query inside the same file, it’s really not at neat solution. There are multiple queries and the text is really big, embed it with the Python code would compromise readability.
All relative paths are taken in reference to the AIRFLOW_HOME environment variable. Try:
- Giving absolute path
- place the file relative to AIRFLOW_HOME
- try logging the PWD in the python callable and then decide what path to give (Best option)
Here is an example use Variable to make it easy.
-
First add Variable in
Airflow UI
->Admin
->Variable
, eg.{key: 'sql_path', values: 'your_sql_script_folder'}
-
Then add following code in your DAG, to use Variable from Airflow you just add.
DAG code:
import airflow
from airflow.models import Variable
tmpl_search_path = Variable.get("sql_path")
dag = airflow.DAG(
'tutorial',
schedule_interval="@daily",
template_searchpath=tmpl_search_path, # this
default_args=default_args
)
-
Now you can use sql script name or path under folder Variable
-
You can learn more in this
Assuming that the sql
directory is relative to the current Python file, you can figure out the absolute path to the sql file like this:
import os
CUR_DIR = os.path.abspath(os.path.dirname(__file__))
def run_query():
# read the query
query = open(f"{CUR_DIR}/sql/queryfile.sql")
# run the query
execute(query)
you can get DAG directory like below.
conf.get('core', 'DAGS_FOLDER')
# open file
open(os.path.join(conf.get('core', 'DAGS_FOLDER'), 'something.json'), 'r')
ref: https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dags-folder