How do I include and install test files in a wheel and deploy to Databricks

Question:

I’m developing some code that runs on Databricks. Given that Databricks can’t be run locally, I need to run unit tests on a Databricks cluster. Problem is when I install the wheel that contains my files, test files are never installed. How do I install the test files?

Ideally I would like to keep src and tests in separate folders.


Here is my project’s (pyproject.toml only) folder structure:

project
├── src
|   ├── mylib
│       ├── functions.py
│       ├── __init__.py
├── pyproject.toml
├── poetry.lock
└── tests
    ├── conftest.py
    └── test_functions.py

My pyproject.toml:

[tool.poetry]
name = "mylib"
version = "0.1.0"
packages = [
    {include = "mylib", from = "src"},
    {include = "tests"}
]

[tool.poetry.dependencies]
python = "^3.8"
pytest = "^7.1.2"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Without {include = "tests"} in pyproject.toml, poetry build doesn’t include tests.

After poetry build I can see that the tests are included in wheel produced (python3 -m wheel unpack <mywheel.whl>). But after I deploy it as a library on a Databricks cluster, I do not see any tests folder (ls -r .../site-packages/mylib* in a Databricks notebook shell cell) though functions.py is installed.

I also tried moving tests under src and update toml to {include = "tests", from = "src"}, but then the wheel file produced contains mylib & tests with appropriate files, but only mylib gets installed on Databricks.

project
├── src
|   ├── mylib
│   │   ├── functions.py
│   │   └── __init__.py
|   └── tests
│       ├── conftest.py
│       └── test_functions.py
├── pyproject.toml
└── poetry.lock

As someone is trying to point to dbx as teh solution, I’ve tried to use it. It doesn’t work. It has a bunch of basic restrictions (e.g. must use ML runtime), which renders it useless, not to mention it expects that you use whatever toolset it recommends. Perhaps in a few years it would do what this post needs.

Asked By: Kashyap

||

Answers:

author of dbx is here.

I’ve updated the public doc, please take a look at this section for details on how to setup integration tests

UPD. as per comment:

It has a bunch of basic restrictions (e.g. must use ML runtime)

This is not a requirement, you just need to use any Databricks Runtime 10+. We’ll change the project doc accordingly to point out this is not a limitation anymore.

it expects that you use whatever toolset it recommends

This statement is simply incorrect.

Here is a step-by-step walkthrough for exactly identical setup as above (maybe this is unclear from the doc, but it contains exactly the same steps):

  1. Create a project dir and move into it:
mkdir mylib && cd mylib
  1. Initialise a poetry project in it:
poetry init -n
  1. Provide the following poetry pyproject.toml:
[tool.poetry]
name = "mylib"
version = "0.1.0"
# without description and authors it won't be compiled to a wheel
description = "some description"
authors = []

packages = [
    {include = "mylib", from = "src"},
]

[tool.poetry.dependencies]
python = "^3.8"
pytest = "^7.1.2"
pytest-cov = "^3.0.0"
dbx = "^0.7.3"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
  1. Install dependencies locally to make dbx available:
poetry install
  1. Write some sample code, e.g. src/mylib/functions.py:
from pyspark.sql import DataFrame
from pyspark.sql.functions import col

def cast_column_to_string(df: DataFrame, col_name: str) -> DataFrame:
    return df.withColumn(col_name, col(col_name).cast("string"))
  1. Write a test for it in tests/integration/sample_test.py:
from mylib.functions import cast_column_to_string
from pyspark.sql import SparkSession

def test_column_to_string():
    spark = SparkSession.builder.getOrCreate()
    df = spark.range(0,10)
    _converted = cast_column_to_string(df, "id")
    assert dict(_converted.dtypes)["id"] == "string"
  1. Create an entrypoint file tests/entrypoint.py:
import sys

import pytest

if __name__ == '__main__':
    pytest.main(sys.argv[1:])
  1. Configure the test workflow in the conf/deployment.yml:

custom:
  basic-cluster-props: &basic-cluster-props
    spark_version: "10.4.x-cpu-ml-scala2.12"

  basic-static-cluster: &basic-static-cluster
    new_cluster:
      <<: *basic-cluster-props
      num_workers: 1
      node_type_id: "{{some-node-type}}"

build:
  commands:
    - "poetry build -f wheel" #safe to use inside poetry venv

environments:
  default:
    workflows:
      - name: "mylib-tests"
        tasks:
          - task_key: "main"
            <<: *basic-static-cluster
            spark_python_task:
                python_file: "file://tests/entrypoint.py"
                # this call supports all standard pytest arguments
                parameters: ["file:fuse://tests/integration", "--cov=mylib"]
  1. Configure dbx to use specific profile:
dbx configure --profile=<your Databricks CLi profile name>

Checkpoint – final layout looks like this at this point:

.
├── conf
│   └── deployment.yml
├── poetry.lock
├── pyproject.toml
├── src
│   └── mylib
│       └── functions.py
└── tests
    ├── entrypoint.py
    └── integration
        └── sample_test.py

  1. Launch the tests on all-purpose cluster (also non-ML clusters are supported since Databricks Runtime version 10+):
dbx execute mylib-tests --task=main --cluster-name=<some-all-purpose-cluster>
  1. [Optional] Launch tests as a job on a job cluster:
dbx deploy mylib-tests --assets-only
dbx launch mylib-tests --from-assets
Answered By: renardeinside

If anyone else is suffering, here is what we ended up doing finally.

TL;DR;

  • Create a unit-test-runner.py that can install a wheel file and execute tests inside of it. Key is to install it at "notebook scope".
  • Deploy/copy unit-test-runner.py to databricks dbfs and create a job pointing to it. Job parameter is the wheel file to pytest.
  • Create a wheel of your code, copy it to databricks dbfs, run the job unit-test-runner with location of wheel file as parameter.

Project structure:

root
├── dist
│   └── my_project-0.1.0-py3-none-any.whl
├── poetry.lock
├── poetry.toml
├── pyproject.toml
├── module1.py
├── module2.py
├── housekeeping.py
├── common
│   └── aws.py
├── tests
│   ├── conftest.py
│   ├── test_module1.py
│   ├── test_module2.py
│   └── common
│       └── test_aws.py
└── unit_test_runner.py

unit-test-runner.py

import importlib.util
import logging
import os
import shutil
import sys
from enum import IntEnum

import pip
import pytest


def main(args: list) -> int:
    coverage_opts = []
    if '--cov' == args[0]:
        coverage_opts = ['--cov']
        wheels_to_test = args[1:]
    else:
        wheels_to_test = args

    logging.info(f'coverage_opts: {coverage_opts}, wheels_to_test: {wheels_to_test}')

    for wh_file in wheels_to_test:
        logging.info('pip install %s', wh_file)
        pip.main(['install', wh_file])
        # we assume wheel name like <pkg name>-version-...
        # E.g. my_module-0.1.0-py3-none-any.whl
        pkg_name = os.path.basename(wh_file).split('-')[0]
        # don't import module to avoid any issues with coverage data.
        pkg_root = os.path.dirname(importlib.util.find_spec(pkg_name).origin)
        os.chdir(pkg_root)

        pytest_opts = [f'--rootdir={pkg_root}']
        pytest_opts.extend(coverage_opts)

        logging.info(f'pytest_opts: {pytest_opts}')
        rc = pytest.main(pytest_opts)
        logging.info(f'pytest-status: {rc}/{os.waitstatus_to_exitcode(rc)}, wheel: {wh_file}')
        generate_coverage_data(pkg_name, pkg_root, wh_file)

        return rc.value if isinstance(rc, IntEnum) else rc


def generate_coverage_data(pkg_name, pkg_root, wh_file):
    if os.path.exists(f'{pkg_root}/.coverage'):
        shutil.rmtree(f'{pkg_root}/htmlcov', ignore_errors=True)
        output_tar = f'{os.path.dirname(wh_file)}/{pkg_name}-coverage.tar.gz'
        rc = os.system(f'coverage html --data-file={pkg_root}/.coverage && tar -cvzf {output_tar} htmlcov')
        logging.info('rc: %s, coverage data available at: %s', rc, output_tar)


if __name__ == "__main__":
    # silence annoying logging
    logging.getLogger("py4j").setLevel(logging.ERROR)
    logging.info('sys.argv[1:]: %s', sys.argv[1:])
    rc = main(sys.argv[1:])
    if rc != 0:
        raise Exception(f'Unit test execution failed. rc: {rc}, sys.argv[1:]: sys.argv[1:]')

WORKSPACE_ROOT='/home/kash/workspaces'
USER_NAME='[email protected]'
cd $WORKSPACE_ROOT/my_project
echo 'copying runner..' && 
  databricks fs cp --overwrite unit_test_runner.py dbfs:/user/$USER_NAME/
  • Go to databricks GUI and create a job pointing to dbfs:/user/$USER_NAME/unit_test_runner.py. Can also be done using CLI.
    • Type of job: Python Script
    • Source: DBFS/S3
    • Path: dbfs:/user/$USER_NAME/unit_test_runner.py
  • Run databricks jobs list to find job id, e.g. 123456789
cd $WORKSPACE_ROOT/my_project
poetry build -f wheel # could be replaced with any builder that creates a wheel file
whl_file=$(ls -1tr dist/my_project*-py3-none-any.whl | tail -1 | xargs basename)
echo 'copying wheel...' && databricks fs cp --overwrite dist/$whl_file dbfs:/user/$USER_NAME/wheels
echo 'running job.....' && echo "launching job.." && 
  databricks jobs run-now --job-id 123456789 --python-params "["/dbfs/user/$USER_NAME/wheels/$whl_file"]"
# OR with coverage
echo 'running job..' && echo "launching job with coverage.." && 
  databricks jobs run-now --job-id 123456789 --python-params "["--cov", "/dbfs/user/$USER_NAME/wheels/$whl_file"]"

If you ran with --cov option then to get and open coverage report:

rm -f htmlcov/ my_project_coverage_report.tar.gz
databricks fs cp dbfs:/user/$USER_NAME/wheels/my_project_coverage_report.tar.gz .
tar -xvzf my_project_coverage_report.tar.gz
firefox htmlcov/index.html
Answered By: Kashyap