Getting Error in airflow DAG unsupported operand type(s) for >>: 'list' and 'list'. Sequential and Parallel execution of tasks
Question:
I am new to Apache airflow and DAG. There are total 6 tasks in the DAG (task1, task2, task3, task4, task5, task6). But at the time of running the DAG we are getting the error below.
DAG unsupported operand type(s) for >>: ‘list’ and ‘list’
Below is my code for the DAG. Please help. I am new to airflow.
from airflow import DAG
from datetime import datetime
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False
}
dag = DAG('DAG_FOR_TEST',default_args=default_args,schedule_interval=None,max_active_runs=3, start_date=datetime(2020, 7, 14))
#################### CREATE TASK #####################################
task_1 = DatabricksSubmitRunOperator(
task_id='task_1',
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_1/task_1.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_1.driver.TestClass1',
'parameters' : [
'{{ dag_run.conf.json }}'
]
}
)
task_2 = DatabricksSubmitRunOperator(
task_id='task_2',
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_2/task_2.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_2.driver.TestClass2',
'parameters' : [
'{{ dag_run.conf.json }}'
]
}
)
task_3 = DatabricksSubmitRunOperator(
task_id='task_3',
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_3/task_3.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_3.driver.TestClass3',
'parameters' : [
'{{ dag_run.conf.json }}'
]
}
)
task_4 = DatabricksSubmitRunOperator(
task_id='task_4',
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_4/task_4.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_4.driver.TestClass4',
'parameters' : [
'{{ dag_run.conf.json }}'
]
}
)
task_5 = DatabricksSubmitRunOperator(
task_id='task_5',
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_5/task_5.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_5.driver.TestClass5',
'parameters' : [
'json ={{ dag_run.conf.json }}'
]
}
)
task_6 = DatabricksSubmitRunOperator(
task_id='task_6',
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_6/task_6.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_6.driver.TestClass6',
'parameters' : ['{{ dag_run.conf.json }}'
]
}
)
#################### ORDER OF OPERATORS ###########################
task_1.dag = dag
task_2.dag = dag
task_3.dag = dag
task_4.dag = dag
task_5.dag = dag
task_6.dag = dag
task_1 >> [task_2 , task_3] >> [ task_4 , task_5 ] >> task_6
Answers:
Airflow task dependencies can’t handle [list]>>[list]. Easiest way around this is to specify your dependencies over multiple lines:
task_1 >> [task_2 , task_3]
task_2 >> [task_4, task_5]
task_3 >> [task_4, task_5]
[task_4 , task_5 ] >> task_6
What is your desired Task Dependency? Do you want to run task_4
after task_2
only or after task_2
and task_3
Based on that answer, use one of the following:
(use this if task_4 should run after both task_2
and task_3
are completed)
task_1 >> [task_2 , task_3]
task_2 >> [task_4, task_5] >> task_6
task_3 >> [task_4, task_5]
OR
(use this if task_4 should run after task_2
is completed and task_5
should run after task_3
is completed)
task_1 >> [task_2 , task_3]
task_2 >> task_4
task_3 >> task_5
[task_4, task_5] >> task_6
A tip, you don’t need to do the following:
task_1.dag = dag
task_2.dag = dag
task_3.dag = dag
task_4.dag = dag
task_5.dag = dag
task_6.dag = dag
You can pass the dag
parameter to your task itself, example:
task_6 = DatabricksSubmitRunOperator(
task_id='task_6',
dag=dag,
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_6/task_6.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_6.driver.TestClass6',
'parameters' : ['{{ dag_run.conf.json }}'
]
}
)
or use DAG as your context manager as documented in https://airflow.apache.org/docs/stable/concepts.html#context-manager and Point (1) in https://medium.com/datareply/airflow-lesser-known-tips-tricks-and-best-practises-cf4d4a90f8f
Same problem here, running Version : 1.10.14+composer GCP:
Code:
i = 0 # iterator
while i < len(LIST_OF_OPS): # list of operators which I have aggregated beforehand
LIST_OF_OPS[i] >> LIST_OF_OPS[i+1:i+8] # one task queues up seven more
i += 7 # in order not to start all of them in parallel, I increase the iterator to one less than the number of tasks started just now; a chain is formed.
This solution allows me to dynamically generate the tasks, and iterate over them. Currently in use for a workflow with 280 tasks.
taskgroup may be useful to address the grouping of some task and adding dynamic dependencies between taskgroups
My solution was to first encapsulate all the leading tasks with a DummyOperator
task, then use the same DummyOperator
task to initiate the next list of tasks.
with dag:
task_1 = DummyOperator(
task_id = 'real_operation_1'
)
task_2 = DummyOperator(
task_id = 'real_operation_2'
)
task_3 = DummyOperator(
task_id = 'dummy_operation'
)
task_4 = DummyOperator(
task_id = 'real_operation_3'
)
task_5 = DummyOperator(
task_id = 'real_operation_4'
)
[task_1, task_2] >> task_3 >> [task_4, task_5]
I am new to Apache airflow and DAG. There are total 6 tasks in the DAG (task1, task2, task3, task4, task5, task6). But at the time of running the DAG we are getting the error below.
DAG unsupported operand type(s) for >>: ‘list’ and ‘list’
Below is my code for the DAG. Please help. I am new to airflow.
from airflow import DAG
from datetime import datetime
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False
}
dag = DAG('DAG_FOR_TEST',default_args=default_args,schedule_interval=None,max_active_runs=3, start_date=datetime(2020, 7, 14))
#################### CREATE TASK #####################################
task_1 = DatabricksSubmitRunOperator(
task_id='task_1',
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_1/task_1.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_1.driver.TestClass1',
'parameters' : [
'{{ dag_run.conf.json }}'
]
}
)
task_2 = DatabricksSubmitRunOperator(
task_id='task_2',
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_2/task_2.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_2.driver.TestClass2',
'parameters' : [
'{{ dag_run.conf.json }}'
]
}
)
task_3 = DatabricksSubmitRunOperator(
task_id='task_3',
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_3/task_3.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_3.driver.TestClass3',
'parameters' : [
'{{ dag_run.conf.json }}'
]
}
)
task_4 = DatabricksSubmitRunOperator(
task_id='task_4',
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_4/task_4.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_4.driver.TestClass4',
'parameters' : [
'{{ dag_run.conf.json }}'
]
}
)
task_5 = DatabricksSubmitRunOperator(
task_id='task_5',
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_5/task_5.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_5.driver.TestClass5',
'parameters' : [
'json ={{ dag_run.conf.json }}'
]
}
)
task_6 = DatabricksSubmitRunOperator(
task_id='task_6',
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_6/task_6.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_6.driver.TestClass6',
'parameters' : ['{{ dag_run.conf.json }}'
]
}
)
#################### ORDER OF OPERATORS ###########################
task_1.dag = dag
task_2.dag = dag
task_3.dag = dag
task_4.dag = dag
task_5.dag = dag
task_6.dag = dag
task_1 >> [task_2 , task_3] >> [ task_4 , task_5 ] >> task_6
Airflow task dependencies can’t handle [list]>>[list]. Easiest way around this is to specify your dependencies over multiple lines:
task_1 >> [task_2 , task_3]
task_2 >> [task_4, task_5]
task_3 >> [task_4, task_5]
[task_4 , task_5 ] >> task_6
What is your desired Task Dependency? Do you want to run task_4
after task_2
only or after task_2
and task_3
Based on that answer, use one of the following:
(use this if task_4 should run after both task_2
and task_3
are completed)
task_1 >> [task_2 , task_3]
task_2 >> [task_4, task_5] >> task_6
task_3 >> [task_4, task_5]
OR
(use this if task_4 should run after task_2
is completed and task_5
should run after task_3
is completed)
task_1 >> [task_2 , task_3]
task_2 >> task_4
task_3 >> task_5
[task_4, task_5] >> task_6
A tip, you don’t need to do the following:
task_1.dag = dag
task_2.dag = dag
task_3.dag = dag
task_4.dag = dag
task_5.dag = dag
task_6.dag = dag
You can pass the dag
parameter to your task itself, example:
task_6 = DatabricksSubmitRunOperator(
task_id='task_6',
dag=dag,
databricks_conn_id='connection_id_details',
existing_cluster_id='{{ dag_run.conf.clusterId }}',
libraries= [
{
'jar': 'dbfs:/task_6/task_6.jar'
}
],
spark_jar_task={
'main_class_name': 'com.task_6.driver.TestClass6',
'parameters' : ['{{ dag_run.conf.json }}'
]
}
)
or use DAG as your context manager as documented in https://airflow.apache.org/docs/stable/concepts.html#context-manager and Point (1) in https://medium.com/datareply/airflow-lesser-known-tips-tricks-and-best-practises-cf4d4a90f8f
Same problem here, running Version : 1.10.14+composer GCP:
Code:
i = 0 # iterator
while i < len(LIST_OF_OPS): # list of operators which I have aggregated beforehand
LIST_OF_OPS[i] >> LIST_OF_OPS[i+1:i+8] # one task queues up seven more
i += 7 # in order not to start all of them in parallel, I increase the iterator to one less than the number of tasks started just now; a chain is formed.
This solution allows me to dynamically generate the tasks, and iterate over them. Currently in use for a workflow with 280 tasks.
taskgroup may be useful to address the grouping of some task and adding dynamic dependencies between taskgroups
My solution was to first encapsulate all the leading tasks with a DummyOperator
task, then use the same DummyOperator
task to initiate the next list of tasks.
with dag:
task_1 = DummyOperator(
task_id = 'real_operation_1'
)
task_2 = DummyOperator(
task_id = 'real_operation_2'
)
task_3 = DummyOperator(
task_id = 'dummy_operation'
)
task_4 = DummyOperator(
task_id = 'real_operation_3'
)
task_5 = DummyOperator(
task_id = 'real_operation_4'
)
[task_1, task_2] >> task_3 >> [task_4, task_5]