Airflow ExternalTaskSensor gets stuck

Question:

I’m trying to use ExternalTaskSensor and it gets stuck at poking another DAG’s task, which has already been successfully completed.

Here, a first DAG “a” completes its task and after that a second DAG “b” through ExternalTaskSensor is supposed to be triggered. Instead it gets stuck at poking for a.first_task.

First DAG:

import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

dag = DAG(
    dag_id='a',
    default_args={'owner': 'airflow', 'start_date': datetime.datetime.now()},
    schedule_interval=None
)

def do_first_task():
    print('First task is done')

PythonOperator(
    task_id='first_task',
    python_callable=do_first_task,
    dag=dag)

Second DAG:

import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.sensors import ExternalTaskSensor

dag = DAG(
    dag_id='b',
    default_args={'owner': 'airflow', 'start_date': datetime.datetime.now()},
    schedule_interval=None
)

def do_second_task():
    print('Second task is done')

ExternalTaskSensor(
    task_id='wait_for_the_first_task_to_be_completed',
    external_dag_id='a',
    external_task_id='first_task',
    dag=dag) >> 
PythonOperator(
    task_id='second_task',
    python_callable=do_second_task,
    dag=dag)

What am I missing here?

Asked By: Aleksei Solovev

||

Answers:

ExternalTaskSensor assumes that you are dependent on a task in a dag run with the same execution date.

This means that in your case dags a and b need to run on the same schedule (e.g. every day at 9:00am or w/e).

Otherwise you need to use the execution_delta or execution_date_fn when you instantiate an ExternalTaskSensor.

Here is the documentation inside the operator itself to help clarify further:

:param execution_delta: time difference with the previous execution to
    look at, the default is the same execution_date as the current task.
    For yesterday, use [positive!] datetime.timedelta(days=1). Either
    execution_delta or execution_date_fn can be passed to
    ExternalTaskSensor, but not both.

:type execution_delta: datetime.timedelta


:param execution_date_fn: function that receives the current execution date
    and returns the desired execution date to query. Either execution_delta
    or execution_date_fn can be passed to ExternalTaskSensor, but not both.

:type execution_date_fn: callable
Answered By: jhnclvr

To clarify something I’ve seen here and on other related questions, the dags don’t necessarily have to run on the same schedule, as stated in the accepted answer. The dags also don’t need to have the same start_date. If you create your ExternalTaskSensor task without the execution_delta or execution_date_fn, then the two dags need to have the same execution date. It so happens that if two dags have the same schedule, the scheduled runs in each interval will have the same execution date. I’m not sure what the execution date would be for manually triggered runs of scheduled dags.

For this example to work, dag b‘s ExternalTaskSensor task needs an execution_delta or execution_date_fn parameter. If using an execution_delta parameter, it should be such that b‘s execution date – execution_delta = a‘s execution date. If using execution_date_fn, then that function should return a‘s execution date.

If you were using the TriggerDagRunOperator, then using an ExternalTaskSensor to detect when that dag completed, you can do something like passing in the main dag’s execution date to the triggered one with the TriggerDagRunOperator‘s execution_date parameter, like execution_date='{{ execution_date }}'. Then the execution date of both dags would be the same, and you wouldn’t need the schedules to be the same for each dag, or to use the execution_delta or execution_date_fn sensor parameters.

The above was written and tested on Airflow 1.10.9

Answered By: tomcm

As of Airflow v1.10.7, tomcm’s answer is not true (at least for this version). One should use execution_delta or execution_date_fn to determine the date AND schedule of the external DAG if they do not have the same schedule.

Answered By: Von Yu

From my successful case:

default_args = {
    'owner': 'xx',
    'retries': 2,
    'email': ALERT_EMAIL_ADDRESSES,
    'email_on_failure': True,
    'email_on_retry': False,
    'retry_delay': timedelta(seconds=30),
    # avoid stopping tasks after one day
    'depends_on_past': False,
}

dag = DAG(
    dag_id = dag_id,
    # get the datetime type value
    start_date = pendulum.strptime(current_date, "%Y, %m, %d, %H").astimezone('Europe/London').subtract(hours=1),
    description = 'xxx',
    default_args = default_args,
    schedule_interval = timedelta(hours=1),
    )
...
    external_sensor= ExternalTaskSensor(
            task_id='ext_sensor_task_update_model',
            external_dag_id='xxx',
            external_task_id='xxx'.format(log_type),
            # set the task_id to None because of the end_task
            # external_task_id = None,
            dag=dag,
            timeout = 300,
            )
...

You can wait until the successful automatic trigger for the tasks. Don’t do it manually, the start_date will be different.

Answered By: Newt

Airflow by default looks for the same execution date, timestamp. And if we use the execution_date_fn parameter, we have to return a list of timestamp values to look for. Internally, the sensor will query the task_instance table of airflow to check the dag runs for the dagid, taskid, state and execution date timestamp provided as the arguments. So if we use a None schedule, the dag has to be triggered manually and in such a case, the date timestamp might be any possible value.
I have explained it in detail here:
https://link.medium.com/QzXm21asokb

I have created a new sensor inheriting the ExternalTaskSensor and it can be used to monitor dags with None schedule. You can find the code at the below repo.
https://github.com/Deepaksai1919/AirflowTaskSensor

Answered By: Deepak Sai

I ran into this as well, but in my case both DAGs were using the same schedule_interval, so none of the above suggestions helped.

Turned out it was an Airflow bug. Templates in the external_task_id/external_task_ids fields are currently broken in v2.2.4: https://github.com/apache/airflow/issues/22782

Answered By: 0x5453

I had the same problem & used the execution_date_fn parameter:

ExternalTaskSensor(
    task_id="sensor",
    external_dag_id="dag_id",
    execution_date_fn=get_most_recent_dag_run,
    mode="reschedule",

where the get_most_recent_dag_run function looks like this :

from airflow.models import DagRun

def get_most_recent_dag_run(dt):
    dag_runs = DagRun.find(dag_id="dag_id")
    dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
    if dag_runs:
        return dag_runs[0].execution_date

As the ExternalTaskSensor needs to know both the dag_id and the exact last_execution_date for cross-DAGs dependencies.

Answered By: Nahid O.
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.