Airflow run python script connected via gcsfuse using PythonOperator

Question:

I want to run a Python script that is stored in this gcp directory:

 /home/airflow/gcsfuse/dags/external/projectXYZ/test.py

I used the Bash Operator before to execute the script which works in theory but I’m getting some errors for some functions in some python libraries. Therefore I want to test the PythonOperator if it works.
For the BashOperator I used the following code snippet:

run_python = BashOperator(
        task_id='run_python',
        bash_command='python /home/airflow/gcsfuse/dags/external/projectXYZ/test.py'
    )

For the PythonOperator I saw some posts importing a function of a python script. However I don’t know how I get Airflow to recognize an import. The only option I have to interact between stuff on the gcp and Airflow is through the gcsfuse/dags/external folder. How can I execute the file from this path instead of calling a function in the PythonOperator?

Asked By: Daniel

||

Answers:

So after some researching and testing I came to the conclusion that it is not possible to execute a python file which is located on a gcp storage bucket with the PytonOperator. If there is a python file in a gcp storage bucket which is connected to Airflow via gcsfuse then you need to use the BashOperator.
If you want to use the PythonOperator you either have to write you python code inside your dag and call a function with the PythonOperator or you import a function from a python file that is already stored on the airflow storage itself and then call this function with the PythonOperator.

Feel free to correct me if I am mistaken

Answered By: Daniel

You can actually call a python script from composer’s standard GCS bucket using the PythonOperator or any of its variances, all you have to do is to create a structure in your GCS DAGs folder pertaining to your requirements and from there call the external python script from the main DAG. The solution was posted before on other threads, but I’m including an example to complement it.

Here is my DAG folder:

enter image description here

And here’s how I define both DAGs:

externaldag.py:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from external.scripts.externalscript import calculateharversine


default_args = {
   'owner': 'me',
   'depends_on_past': False,
   'start_date': datetime.now(),
   'email': ['[email protected]'],
   'email_on_failure': False,
   'email_on_retry': False,
   'retries': 1,
   'retry_delay': timedelta(minutes=5)
}


dag = DAG('externaldag_3',
         default_args=default_args,
         schedule_interval=None,
         catchup=False)


with dag:
   print_haversine_task = PythonOperator(
       task_id='print-haversine-distance',
       python_callable=calculateharversine
   )


if __name__ == "__main__":
   dag.clear(dag_run_state=State.NONE)
   dag.run()

externalscript.py

def calculateharversine():
   from mypythonlib import myfunctions
   x1, y1, x2, y2 = 1, 2, 3, 4
   haversine_distance = myfunctions.haversine(x1, y1, x2, y2)
   print(f'The haversine distance between ({x1}, {y1}) and ({x2}, {y2}) est {haversine_distance}')

I hope it helps.

Answered By: Gilberto Gutiérrez
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.