How to get Airflow Docker ExternalPythonOperator working with in a python venv?

Question:

Situation

  • Since 2022 Sept 19 The release of Apache Airflow 2.4.0
  • Airflow supports ExternalPythonOperator
  • I have asked the main contributors as well and I should be able to add 2 python virtual environments to the base image of Airflow Docker 2.4.1 and be able to rune single tasks inside a DAG.

Goal

  • My goal is to use multiple host python virtualenvs that built from a local requirements.txt.
  • using ExternalPythonOperator to run them
  • Each of my dags just execute a timed python function

I would like to request

  • Example files how to create a separate consciously existing python virtual environments, built via the base docker Airflow 2.4.1 image and the:
    • docker-compose.yml #best option so I only need to use docker-compose on the official image
    • Dockerfile # second best option but because I need to docker compose the official image with some of my takes on the docker-compose.yml file

System

  • 2.4.1 Docker image that works.
  • ubuntu 20.04 LTS

Knowledge gaps

I don’t want this

  • PythonVirtualenvOperator to create those venvs dynamically. (Successfully performed this, but I have too light weight dags or too many import one so it is not ideal to use)
  • I have 1 python function / DAG so it is nine I don’t need this -> "Note that te virtualenvs are per task not per DAGs. You cannot (for now) parse your DAGs and execute whole dags in different virtualenv – you can execute individual Python* tasks in those. Separate runtime environment for "whole DAGs" will likely be implemented in 2.4 or 2.6 as result of https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing"

Terminal Commands

docker build -t my-image-apache/airflow:2.4.1 .

I would run a the following command afterwards but the 1st step fails

docker-compose up

My Files

docker-compose.yml

https://airflow.apache.org/docs/apache-airflow/2.4.1/docker-compose.yaml

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#

# Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.
#
# WARNING: This configuration is for local development. Do not use it in a production deployment.
#
# This configuration supports basic configuration using environment variables or an .env file
# The following variables are supported:
#
# AIRFLOW_IMAGE_NAME           - Docker image name used to run Airflow.
#                                Default: apache/airflow:2.4.1
# AIRFLOW_UID                  - User ID in Airflow containers
#                                Default: 50000
# Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode
#
# _AIRFLOW_WWW_USER_USERNAME   - Username for the administrator account (if requested).
#                                Default: airflow
# _AIRFLOW_WWW_USER_PASSWORD   - Password for the administrator account (if requested).
#                                Default: airflow
# _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers.
#                                Default: ''
#
# Feel free to modify this file to suit your needs.
---
version: '3'
x-airflow-common:
  &airflow-common
  # In order to add custom dependencies or upgrade provider packages you can use your extended image.
  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
  # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
  image: ${AIRFLOW_IMAGE_NAME:-my-image-apache/airflow:2.4.1}
  # build: .
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    # For backward compatibility, with Airflow <2.3
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
    AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth'
    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
  user: "${AIRFLOW_UID:-50000}:0"
  depends_on:
    &airflow-common-depends-on
    redis:
      condition: service_healthy
    postgres:
      condition: service_healthy

services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - postgres-db-volume:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 5s
      retries: 5
    restart: always

  redis:
    image: redis:latest
    expose:
      - 6379
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 30s
      retries: 50
    restart: always

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - 8080:8080
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    healthcheck:
      test: ["CMD-SHELL", 'airflow jobs check --job-type SchedulerJob --hostname "$${HOSTNAME}"']
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-worker:
    <<: *airflow-common
    command: celery worker
    healthcheck:
      test:
        - "CMD-SHELL"
        - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
      interval: 10s
      timeout: 10s
      retries: 5
    environment:
      <<: *airflow-common-env
      # Required to handle warm shutdown of the celery workers properly
      # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation
      DUMB_INIT_SETSID: "0"
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-triggerer:
    <<: *airflow-common
    command: triggerer
    healthcheck:
      test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"']
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    # yamllint disable rule:line-length
    command:
      - -c
      - |
        function ver() {
          printf "%04d%04d%04d%04d" $${1//./ }
        }
        airflow_version=$$(AIRFLOW__LOGGING__LOGGING_LEVEL=INFO && gosu airflow airflow version)
        airflow_version_comparable=$$(ver $${airflow_version})
        min_airflow_version=2.2.0
        min_airflow_version_comparable=$$(ver $${min_airflow_version})
        if (( airflow_version_comparable < min_airflow_version_comparable )); then
          echo
          echo -e "33[1;31mERROR!!!: Too old Airflow version $${airflow_version}!e[0m"
          echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!"
          echo
          exit 1
        fi
        if [[ -z "${AIRFLOW_UID}" ]]; then
          echo
          echo -e "33[1;33mWARNING!!!: AIRFLOW_UID not set!e[0m"
          echo "If you are on Linux, you SHOULD follow the instructions below to set "
          echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
          echo "For other operating systems you can get rid of the warning with manually created .env file:"
          echo "    See: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user"
          echo
        fi
        one_meg=1048576
        mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
        cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
        disk_available=$$(df / | tail -1 | awk '{print $$4}')
        warning_resources="false"
        if (( mem_available < 4000 )) ; then
          echo
          echo -e "33[1;33mWARNING!!!: Not enough memory available for Docker.e[0m"
          echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
          echo
          warning_resources="true"
        fi
        if (( cpus_available < 2 )); then
          echo
          echo -e "33[1;33mWARNING!!!: Not enough CPUS available for Docker.e[0m"
          echo "At least 2 CPUs recommended. You have $${cpus_available}"
          echo
          warning_resources="true"
        fi
        if (( disk_available < one_meg * 10 )); then
          echo
          echo -e "33[1;33mWARNING!!!: Not enough Disk space available for Docker.e[0m"
          echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
          echo
          warning_resources="true"
        fi
        if [[ $${warning_resources} == "true" ]]; then
          echo
          echo -e "33[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!e[0m"
          echo "Please follow the instructions to increase amount of resources available:"
          echo "   https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin"
          echo
        fi
        mkdir -p /sources/logs /sources/dags /sources/plugins
        chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
        exec /entrypoint airflow version
    # yamllint enable rule:line-length
    environment:
      <<: *airflow-common-env
      _AIRFLOW_DB_UPGRADE: 'true'
      _AIRFLOW_WWW_USER_CREATE: 'true'
      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
      _PIP_ADDITIONAL_REQUIREMENTS: ''
    user: "0:0"
    volumes:
      - .:/sources

  airflow-cli:
    <<: *airflow-common
    profiles:
      - debug
    environment:
      <<: *airflow-common-env
      CONNECTION_CHECK_MAX_COUNT: "0"
    # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252
    command:
      - bash
      - -c
      - airflow

  # You can enable flower by adding "--profile flower" option e.g. docker-compose --profile flower up
  # or by explicitly targeted on the command line e.g. docker-compose up flower.
  # See: https://docs.docker.com/compose/profiles/
  flower:
    <<: *airflow-common
    command: celery flower
    profiles:
      - flower
    ports:
      - 5555:5555
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

volumes:
  postgres-db-volume:

Dockrfile (all the mess that I have tied)

FROM apache/airflow:2.4.1-python3.8

# https://pythonspeed.com/articles/activate-virtualenv-dockerfile/
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# Install dependencies:
COPY requirements.txt .
RUN pip install -r requirements.txt

# Run the application:
# COPY myapp.py .
# CMD ["python", "myapp.py"]


# RUN python3 -m venv /path/to/new/virtual/environment_1 && 
#     /path/to/new/virtual/environment_1/bin/python 
#     -m pip install requirements.txt
# RUN python3 -m venv /path/to/new/virtual/environment_2 && 
#     /path/to/new/virtual/environment_2/bin/python 
#     -m pip install my_requirements_2.txt

ERRORs

I have had python envs in docker before Dockerfile ex.:

FROM python:3.9-slim-bullseye

RUN python3 -m venv /opt/venv

# Install dependencies:
COPY requirements.txt .
RUN . /opt/venv/bin/activate && pip install -r requirements.txt

# Run the application:
COPY myapp.py .
CMD . /opt/venv/bin/activate && exec python myapp.py

Dockerfile: but with airflow it just doesn’t works:

FROM apache/airflow:2.4.1-python3.8
COPY requirements.txt .
RUN python3 -m venv /opt/airflow/virtual_1 && 
/opt/airflow/virtual_1/bin/python 
-m pip install requirements.txt

ERROR

 => ERROR [stage-1 2/2] RUN python3 -m venv /opt/airflow/virtual_1 && /opt/airflow/virtual_1/bin/python -m pip install requirements.txt 

other thing I have tried

1.)

FROM apache/airflow:2.4.1-python3.8
RUN python3 -m venv /opt/airflow
# Install dependencies:
COPY requirements.txt .
RUN /opt/airflow/venv/bin/pip install -r requirements.txt

command – docker build -t my-image-apache/airflow:2.4.1 .

error
=> ERROR [4/4] RUN /opt/airflow/venv/bin/pip install -r requirements.txt

2.)

FROM apache/airflow:2.4.1-python3.8
COPY requirements.txt .
RUN python3 -m venv && 
    /venv/bin/python install -m pip requirements.txt

error
=> ERROR [3/3] RUN python3 -m venv && /venv/bin/python install -m pip requirements.txt

Asked By: sogu

||

Answers:

Simpler Alternatives to Airflow

I would rather not recommend airflow if you are not too invested in to this there are easy to use alternatives:

  1. Mage ai – https://github.com/mage-ai/mage-ai
  2. jupyter scheduler – https://www.google.com/search?client=firefox-b-d&q=jupyter+schduler
  3. !!PAYED – https://docs.qubole.com/en/latest/user-guide/notebooks-and-dashboards/notebooks/jupyter-notebooks/scheduling-jupy-notebooks.html
  4. jupyterlab-scheduler 0.1.5 – https://pypi.org/project/jupyterlab-scheduler/
  5. https://pypi.org/project/notebooker/
  6. notebooker 0.4.4 – https://pypi.org/project/notebooker/
  7. papermill – https://pypi.org/project/papermill/

How to do it with Airflow

1.) Original Dockerfile

[JUST TEXT, CHANGABLE] that becomes the original image that you can pull- https://hub.docker.com/r/apache/airflow/dockerfile

2.) Original image

[COMPILED, CHANGABLE] that is created from the original Dockerfile – https://hub.docker.com/layers/apache/airflow/latest/images/sha256-5015db92023bebb1e8518767bfa2e465b2f52270aca6a9cdef85d5d3e216d015?context=explore

3.) MY requirements.txt

requirements.txt – Do not have to have airflow installed in it.

pandas==1.3.0
numpy==1.20.3

3.) My Dockerfile

This pulls the original image and extends it

FROM apache/airflow:2.4.1-python3.8

# Compulsory to switch parameter
ENV PIP_USER=false

#python venv setup
RUN python3 -m venv /opt/airflow/venv1

# Install dependencies:
COPY requirements.txt .

# --user   <--- WRONG, this is what ENV PIP_USER=false turns off
#RUN /opt/airflow/venv1/bin/pip install --user -r requirements.txt  <---this is all wrong
RUN /opt/airflow/venv1/bin/pip install -r requirements.txt
RUN /opt/airflow/venv1/bin/pip install 'apache-airflow==2.4.1' --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.4.1/constraints-3.8.txt"

ENV PIP_USER=true

4.) Terminal Command

(be in the same library as your file must be called "Dockerfile")

docker build -t my-image-apache/airflow:2.4.1 .

5.) DAG File

mkdir -p ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)" > .env
  • if you have generated an ".env" file you can set up new username and password there fore the main user.

6.) ex test DAG

  • !!! dag_id , task_id – have to be unique!!
  • !! Dag files automaticly added the running webserver if you drag them in to the local dags folder it just takes 5-10 min, if you modify an existing one them that gets refreshed in the webserver almost immediately
  • ! # python=os.fspath(sys.executable) –> ‘/opt/airflow/venv1/bin/python3’ <– have to point to an executable python file in thy python virtual environemnt
"""
Example DAG demonstrating the usage of the TaskFlow API to execute Python functions natively and within a
virtual environment.
"""
from __future__ import annotations

import logging
import os
import shutil
import sys
import tempfile
import time
from pprint import pprint

import pendulum

from airflow import DAG
from airflow.decorators import task

log = logging.getLogger(__name__)

PYTHON = sys.executable

BASE_DIR = tempfile.gettempdir()

with DAG(
    dag_id='test_external_python_venv_dag2',
    schedule=None,
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    tags=['my_test'],
) as dag:
    #@task.external_python(task_id="test_external_python_venv_task", python=os.fspath(sys.executable))
    # /opt/airflow/venv1/bin/python3  <-- have to point to an executable python file in thy python virtual environemnt
    @task.external_python(task_id="test_external_python_venv_task", python='/opt/airflow/venv1/bin/python3')
    def test_external_python_venv_def():
        """
        Example function that will be performed in a virtual environment.
        Importing at the module level ensures that it will not attempt to import the
        library before it is installed.
        """
        import sys
        from time import sleep
        ########## MY CODE ##########
        import numpy as np
        import pandas as pd
        d = {'col1': [1, 2], 'col2': [3, 4]}
        df = pd.DataFrame(data=d)
        print(df)
        a = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
        print(a)
        #a= 10
        return a
        ########## XXXXX MY CODE XXXXX ##########

        print(f"Running task via {sys.executable}")
        print("Sleeping")
        for _ in range(4):
            print('Please wait...', flush=True)
            sleep(1)
        print('Finished')

    external_python_task = test_external_python_venv_def()

7.) docker-compose.yml

Official original docker-compose.yml file https://airflow.apache.org/docs/apache-airflow/2.4.1/docker-compose.yaml modify this part:

## Feel free to modify this file to suit your needs.
---
version: '3'
x-airflow-common:
  &airflow-common
  # In order to add custom dependencies or upgrade provider packages you can use your extended image.
  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
  # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
  image: ${AIRFLOW_IMAGE_NAME:-my-image-apache/airflow:2.4.1} #<- this is because of my terminal command above section
#  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.4.1} <--- THIS WAS THE ORIGINAL
  environment:
    #.... many staff here originaly in this environment section.....
    AIRFLOW__CORE__ENABLE_XCOM_PICKLING: 'true' # <--ADD THIS. This is internal communication for airflow
  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs

9.) Build image to Container

(be in the same library as your file must be called "docker-compose.yml")

docker-compose up

or start detached from the terminal by

docker-compose up -d

10.) Logs

If you ever want to see the logs of your container on mac and Windows the Docker APP GUI allows you to do that on Linux you can use the following command

docker logs -f CONTATINER_ACTUAL_ID

You can quit it withouth closing the container by pressing

CTRL + c

11.) Shut down Container:

  • normal way docker-compose down
  • or if you are in the logs press CTRL + C

!!! Stop and delete containers, delete volumes with database data and download images, run.

docker-compose down --volumes --rmi all

Answered By: sogu