How do I include and install test files in a wheel and deploy to Databricks
Question:
I’m developing some code that runs on Databricks. Given that Databricks can’t be run locally, I need to run unit tests on a Databricks cluster. Problem is when I install the wheel that contains my files, test files are never installed. How do I install the test files?
Ideally I would like to keep src
and tests
in separate folders.
Here is my project’s (pyproject.toml
only) folder structure:
project
├── src
| ├── mylib
│ ├── functions.py
│ ├── __init__.py
├── pyproject.toml
├── poetry.lock
└── tests
├── conftest.py
└── test_functions.py
My pyproject.toml
:
[tool.poetry]
name = "mylib"
version = "0.1.0"
packages = [
{include = "mylib", from = "src"},
{include = "tests"}
]
[tool.poetry.dependencies]
python = "^3.8"
pytest = "^7.1.2"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Without {include = "tests"}
in pyproject.toml
, poetry build
doesn’t include tests.
After poetry build
I can see that the tests are included in wheel produced (python3 -m wheel unpack <mywheel.whl>
). But after I deploy it as a library on a Databricks cluster, I do not see any tests folder (ls -r .../site-packages/mylib*
in a Databricks notebook shell cell) though functions.py
is installed.
I also tried moving tests
under src
and update toml to {include = "tests", from = "src"}
, but then the wheel file produced contains mylib
& tests
with appropriate files, but only mylib
gets installed on Databricks.
project
├── src
| ├── mylib
│ │ ├── functions.py
│ │ └── __init__.py
| └── tests
│ ├── conftest.py
│ └── test_functions.py
├── pyproject.toml
└── poetry.lock
As someone is trying to point to dbx
as teh solution, I’ve tried to use it. It doesn’t work. It has a bunch of basic restrictions (e.g. must use ML runtime), which renders it useless, not to mention it expects that you use whatever toolset it recommends. Perhaps in a few years it would do what this post needs.
Answers:
author of dbx is here.
I’ve updated the public doc, please take a look at this section for details on how to setup integration tests
UPD. as per comment:
It has a bunch of basic restrictions (e.g. must use ML runtime)
This is not a requirement, you just need to use any Databricks Runtime 10+. We’ll change the project doc accordingly to point out this is not a limitation anymore.
it expects that you use whatever toolset it recommends
This statement is simply incorrect.
Here is a step-by-step walkthrough for exactly identical setup as above (maybe this is unclear from the doc, but it contains exactly the same steps):
- Create a project dir and move into it:
mkdir mylib && cd mylib
- Initialise a poetry project in it:
poetry init -n
- Provide the following poetry
pyproject.toml
:
[tool.poetry]
name = "mylib"
version = "0.1.0"
# without description and authors it won't be compiled to a wheel
description = "some description"
authors = []
packages = [
{include = "mylib", from = "src"},
]
[tool.poetry.dependencies]
python = "^3.8"
pytest = "^7.1.2"
pytest-cov = "^3.0.0"
dbx = "^0.7.3"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
- Install dependencies locally to make
dbx
available:
poetry install
- Write some sample code, e.g.
src/mylib/functions.py
:
from pyspark.sql import DataFrame
from pyspark.sql.functions import col
def cast_column_to_string(df: DataFrame, col_name: str) -> DataFrame:
return df.withColumn(col_name, col(col_name).cast("string"))
- Write a test for it in
tests/integration/sample_test.py
:
from mylib.functions import cast_column_to_string
from pyspark.sql import SparkSession
def test_column_to_string():
spark = SparkSession.builder.getOrCreate()
df = spark.range(0,10)
_converted = cast_column_to_string(df, "id")
assert dict(_converted.dtypes)["id"] == "string"
- Create an entrypoint file
tests/entrypoint.py
:
import sys
import pytest
if __name__ == '__main__':
pytest.main(sys.argv[1:])
- Configure the test workflow in the
conf/deployment.yml
:
custom:
basic-cluster-props: &basic-cluster-props
spark_version: "10.4.x-cpu-ml-scala2.12"
basic-static-cluster: &basic-static-cluster
new_cluster:
<<: *basic-cluster-props
num_workers: 1
node_type_id: "{{some-node-type}}"
build:
commands:
- "poetry build -f wheel" #safe to use inside poetry venv
environments:
default:
workflows:
- name: "mylib-tests"
tasks:
- task_key: "main"
<<: *basic-static-cluster
spark_python_task:
python_file: "file://tests/entrypoint.py"
# this call supports all standard pytest arguments
parameters: ["file:fuse://tests/integration", "--cov=mylib"]
- Configure
dbx
to use specific profile:
dbx configure --profile=<your Databricks CLi profile name>
Checkpoint – final layout looks like this at this point:
.
├── conf
│ └── deployment.yml
├── poetry.lock
├── pyproject.toml
├── src
│ └── mylib
│ └── functions.py
└── tests
├── entrypoint.py
└── integration
└── sample_test.py
- Launch the tests on all-purpose cluster (also non-ML clusters are supported since Databricks Runtime version 10+):
dbx execute mylib-tests --task=main --cluster-name=<some-all-purpose-cluster>
- [Optional] Launch tests as a job on a job cluster:
dbx deploy mylib-tests --assets-only
dbx launch mylib-tests --from-assets
If anyone else is suffering, here is what we ended up doing finally.
TL;DR;
- Create a
unit-test-runner.py
that can install a wheel file and execute tests inside of it. Key is to install it at "notebook scope".
- Deploy/copy
unit-test-runner.py
to databricks dbfs and create a job pointing to it. Job parameter is the wheel file to pytest
.
- Create a wheel of your code, copy it to databricks dbfs, run the job unit-test-runner with location of wheel file as parameter.
Project structure:
root
├── dist
│ └── my_project-0.1.0-py3-none-any.whl
├── poetry.lock
├── poetry.toml
├── pyproject.toml
├── module1.py
├── module2.py
├── housekeeping.py
├── common
│ └── aws.py
├── tests
│ ├── conftest.py
│ ├── test_module1.py
│ ├── test_module2.py
│ └── common
│ └── test_aws.py
└── unit_test_runner.py
unit-test-runner.py
import importlib.util
import logging
import os
import shutil
import sys
from enum import IntEnum
import pip
import pytest
def main(args: list) -> int:
coverage_opts = []
if '--cov' == args[0]:
coverage_opts = ['--cov']
wheels_to_test = args[1:]
else:
wheels_to_test = args
logging.info(f'coverage_opts: {coverage_opts}, wheels_to_test: {wheels_to_test}')
for wh_file in wheels_to_test:
logging.info('pip install %s', wh_file)
pip.main(['install', wh_file])
# we assume wheel name like <pkg name>-version-...
# E.g. my_module-0.1.0-py3-none-any.whl
pkg_name = os.path.basename(wh_file).split('-')[0]
# don't import module to avoid any issues with coverage data.
pkg_root = os.path.dirname(importlib.util.find_spec(pkg_name).origin)
os.chdir(pkg_root)
pytest_opts = [f'--rootdir={pkg_root}']
pytest_opts.extend(coverage_opts)
logging.info(f'pytest_opts: {pytest_opts}')
rc = pytest.main(pytest_opts)
logging.info(f'pytest-status: {rc}/{os.waitstatus_to_exitcode(rc)}, wheel: {wh_file}')
generate_coverage_data(pkg_name, pkg_root, wh_file)
return rc.value if isinstance(rc, IntEnum) else rc
def generate_coverage_data(pkg_name, pkg_root, wh_file):
if os.path.exists(f'{pkg_root}/.coverage'):
shutil.rmtree(f'{pkg_root}/htmlcov', ignore_errors=True)
output_tar = f'{os.path.dirname(wh_file)}/{pkg_name}-coverage.tar.gz'
rc = os.system(f'coverage html --data-file={pkg_root}/.coverage && tar -cvzf {output_tar} htmlcov')
logging.info('rc: %s, coverage data available at: %s', rc, output_tar)
if __name__ == "__main__":
# silence annoying logging
logging.getLogger("py4j").setLevel(logging.ERROR)
logging.info('sys.argv[1:]: %s', sys.argv[1:])
rc = main(sys.argv[1:])
if rc != 0:
raise Exception(f'Unit test execution failed. rc: {rc}, sys.argv[1:]: sys.argv[1:]')
- Install and configure
databricks-cli
. See instructions here.
WORKSPACE_ROOT='/home/kash/workspaces'
USER_NAME='[email protected]'
cd $WORKSPACE_ROOT/my_project
echo 'copying runner..' &&
databricks fs cp --overwrite unit_test_runner.py dbfs:/user/$USER_NAME/
- Go to databricks GUI and create a job pointing to
dbfs:/user/$USER_NAME/unit_test_runner.py
. Can also be done using CLI.
- Type of job: Python Script
- Source: DBFS/S3
- Path:
dbfs:/user/$USER_NAME/unit_test_runner.py
- Run
databricks jobs list
to find job id, e.g. 123456789
cd $WORKSPACE_ROOT/my_project
poetry build -f wheel # could be replaced with any builder that creates a wheel file
whl_file=$(ls -1tr dist/my_project*-py3-none-any.whl | tail -1 | xargs basename)
echo 'copying wheel...' && databricks fs cp --overwrite dist/$whl_file dbfs:/user/$USER_NAME/wheels
echo 'running job.....' && echo "launching job.." &&
databricks jobs run-now --job-id 123456789 --python-params "["/dbfs/user/$USER_NAME/wheels/$whl_file"]"
# OR with coverage
echo 'running job..' && echo "launching job with coverage.." &&
databricks jobs run-now --job-id 123456789 --python-params "["--cov", "/dbfs/user/$USER_NAME/wheels/$whl_file"]"
If you ran with --cov
option then to get and open coverage report:
rm -f htmlcov/ my_project_coverage_report.tar.gz
databricks fs cp dbfs:/user/$USER_NAME/wheels/my_project_coverage_report.tar.gz .
tar -xvzf my_project_coverage_report.tar.gz
firefox htmlcov/index.html
I’m developing some code that runs on Databricks. Given that Databricks can’t be run locally, I need to run unit tests on a Databricks cluster. Problem is when I install the wheel that contains my files, test files are never installed. How do I install the test files?
Ideally I would like to keep src
and tests
in separate folders.
Here is my project’s (pyproject.toml
only) folder structure:
project
├── src
| ├── mylib
│ ├── functions.py
│ ├── __init__.py
├── pyproject.toml
├── poetry.lock
└── tests
├── conftest.py
└── test_functions.py
My pyproject.toml
:
[tool.poetry]
name = "mylib"
version = "0.1.0"
packages = [
{include = "mylib", from = "src"},
{include = "tests"}
]
[tool.poetry.dependencies]
python = "^3.8"
pytest = "^7.1.2"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Without {include = "tests"}
in pyproject.toml
, poetry build
doesn’t include tests.
After poetry build
I can see that the tests are included in wheel produced (python3 -m wheel unpack <mywheel.whl>
). But after I deploy it as a library on a Databricks cluster, I do not see any tests folder (ls -r .../site-packages/mylib*
in a Databricks notebook shell cell) though functions.py
is installed.
I also tried moving tests
under src
and update toml to {include = "tests", from = "src"}
, but then the wheel file produced contains mylib
& tests
with appropriate files, but only mylib
gets installed on Databricks.
project
├── src
| ├── mylib
│ │ ├── functions.py
│ │ └── __init__.py
| └── tests
│ ├── conftest.py
│ └── test_functions.py
├── pyproject.toml
└── poetry.lock
As someone is trying to point to dbx
as teh solution, I’ve tried to use it. It doesn’t work. It has a bunch of basic restrictions (e.g. must use ML runtime), which renders it useless, not to mention it expects that you use whatever toolset it recommends. Perhaps in a few years it would do what this post needs.
author of dbx is here.
I’ve updated the public doc, please take a look at this section for details on how to setup integration tests
UPD. as per comment:
It has a bunch of basic restrictions (e.g. must use ML runtime)
This is not a requirement, you just need to use any Databricks Runtime 10+. We’ll change the project doc accordingly to point out this is not a limitation anymore.
it expects that you use whatever toolset it recommends
This statement is simply incorrect.
Here is a step-by-step walkthrough for exactly identical setup as above (maybe this is unclear from the doc, but it contains exactly the same steps):
- Create a project dir and move into it:
mkdir mylib && cd mylib
- Initialise a poetry project in it:
poetry init -n
- Provide the following poetry
pyproject.toml
:
[tool.poetry]
name = "mylib"
version = "0.1.0"
# without description and authors it won't be compiled to a wheel
description = "some description"
authors = []
packages = [
{include = "mylib", from = "src"},
]
[tool.poetry.dependencies]
python = "^3.8"
pytest = "^7.1.2"
pytest-cov = "^3.0.0"
dbx = "^0.7.3"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
- Install dependencies locally to make
dbx
available:
poetry install
- Write some sample code, e.g.
src/mylib/functions.py
:
from pyspark.sql import DataFrame
from pyspark.sql.functions import col
def cast_column_to_string(df: DataFrame, col_name: str) -> DataFrame:
return df.withColumn(col_name, col(col_name).cast("string"))
- Write a test for it in
tests/integration/sample_test.py
:
from mylib.functions import cast_column_to_string
from pyspark.sql import SparkSession
def test_column_to_string():
spark = SparkSession.builder.getOrCreate()
df = spark.range(0,10)
_converted = cast_column_to_string(df, "id")
assert dict(_converted.dtypes)["id"] == "string"
- Create an entrypoint file
tests/entrypoint.py
:
import sys
import pytest
if __name__ == '__main__':
pytest.main(sys.argv[1:])
- Configure the test workflow in the
conf/deployment.yml
:
custom:
basic-cluster-props: &basic-cluster-props
spark_version: "10.4.x-cpu-ml-scala2.12"
basic-static-cluster: &basic-static-cluster
new_cluster:
<<: *basic-cluster-props
num_workers: 1
node_type_id: "{{some-node-type}}"
build:
commands:
- "poetry build -f wheel" #safe to use inside poetry venv
environments:
default:
workflows:
- name: "mylib-tests"
tasks:
- task_key: "main"
<<: *basic-static-cluster
spark_python_task:
python_file: "file://tests/entrypoint.py"
# this call supports all standard pytest arguments
parameters: ["file:fuse://tests/integration", "--cov=mylib"]
- Configure
dbx
to use specific profile:
dbx configure --profile=<your Databricks CLi profile name>
Checkpoint – final layout looks like this at this point:
.
├── conf
│ └── deployment.yml
├── poetry.lock
├── pyproject.toml
├── src
│ └── mylib
│ └── functions.py
└── tests
├── entrypoint.py
└── integration
└── sample_test.py
- Launch the tests on all-purpose cluster (also non-ML clusters are supported since Databricks Runtime version 10+):
dbx execute mylib-tests --task=main --cluster-name=<some-all-purpose-cluster>
- [Optional] Launch tests as a job on a job cluster:
dbx deploy mylib-tests --assets-only
dbx launch mylib-tests --from-assets
If anyone else is suffering, here is what we ended up doing finally.
TL;DR;
- Create a
unit-test-runner.py
that can install a wheel file and execute tests inside of it. Key is to install it at "notebook scope". - Deploy/copy
unit-test-runner.py
to databricks dbfs and create a job pointing to it. Job parameter is the wheel file topytest
. - Create a wheel of your code, copy it to databricks dbfs, run the job unit-test-runner with location of wheel file as parameter.
Project structure:
root
├── dist
│ └── my_project-0.1.0-py3-none-any.whl
├── poetry.lock
├── poetry.toml
├── pyproject.toml
├── module1.py
├── module2.py
├── housekeeping.py
├── common
│ └── aws.py
├── tests
│ ├── conftest.py
│ ├── test_module1.py
│ ├── test_module2.py
│ └── common
│ └── test_aws.py
└── unit_test_runner.py
unit-test-runner.py
import importlib.util
import logging
import os
import shutil
import sys
from enum import IntEnum
import pip
import pytest
def main(args: list) -> int:
coverage_opts = []
if '--cov' == args[0]:
coverage_opts = ['--cov']
wheels_to_test = args[1:]
else:
wheels_to_test = args
logging.info(f'coverage_opts: {coverage_opts}, wheels_to_test: {wheels_to_test}')
for wh_file in wheels_to_test:
logging.info('pip install %s', wh_file)
pip.main(['install', wh_file])
# we assume wheel name like <pkg name>-version-...
# E.g. my_module-0.1.0-py3-none-any.whl
pkg_name = os.path.basename(wh_file).split('-')[0]
# don't import module to avoid any issues with coverage data.
pkg_root = os.path.dirname(importlib.util.find_spec(pkg_name).origin)
os.chdir(pkg_root)
pytest_opts = [f'--rootdir={pkg_root}']
pytest_opts.extend(coverage_opts)
logging.info(f'pytest_opts: {pytest_opts}')
rc = pytest.main(pytest_opts)
logging.info(f'pytest-status: {rc}/{os.waitstatus_to_exitcode(rc)}, wheel: {wh_file}')
generate_coverage_data(pkg_name, pkg_root, wh_file)
return rc.value if isinstance(rc, IntEnum) else rc
def generate_coverage_data(pkg_name, pkg_root, wh_file):
if os.path.exists(f'{pkg_root}/.coverage'):
shutil.rmtree(f'{pkg_root}/htmlcov', ignore_errors=True)
output_tar = f'{os.path.dirname(wh_file)}/{pkg_name}-coverage.tar.gz'
rc = os.system(f'coverage html --data-file={pkg_root}/.coverage && tar -cvzf {output_tar} htmlcov')
logging.info('rc: %s, coverage data available at: %s', rc, output_tar)
if __name__ == "__main__":
# silence annoying logging
logging.getLogger("py4j").setLevel(logging.ERROR)
logging.info('sys.argv[1:]: %s', sys.argv[1:])
rc = main(sys.argv[1:])
if rc != 0:
raise Exception(f'Unit test execution failed. rc: {rc}, sys.argv[1:]: sys.argv[1:]')
- Install and configure
databricks-cli
. See instructions here.
WORKSPACE_ROOT='/home/kash/workspaces'
USER_NAME='[email protected]'
cd $WORKSPACE_ROOT/my_project
echo 'copying runner..' &&
databricks fs cp --overwrite unit_test_runner.py dbfs:/user/$USER_NAME/
- Go to databricks GUI and create a job pointing to
dbfs:/user/$USER_NAME/unit_test_runner.py
. Can also be done using CLI.- Type of job: Python Script
- Source: DBFS/S3
- Path:
dbfs:/user/$USER_NAME/unit_test_runner.py
- Run
databricks jobs list
to find job id, e.g. 123456789
cd $WORKSPACE_ROOT/my_project
poetry build -f wheel # could be replaced with any builder that creates a wheel file
whl_file=$(ls -1tr dist/my_project*-py3-none-any.whl | tail -1 | xargs basename)
echo 'copying wheel...' && databricks fs cp --overwrite dist/$whl_file dbfs:/user/$USER_NAME/wheels
echo 'running job.....' && echo "launching job.." &&
databricks jobs run-now --job-id 123456789 --python-params "["/dbfs/user/$USER_NAME/wheels/$whl_file"]"
# OR with coverage
echo 'running job..' && echo "launching job with coverage.." &&
databricks jobs run-now --job-id 123456789 --python-params "["--cov", "/dbfs/user/$USER_NAME/wheels/$whl_file"]"
If you ran with --cov
option then to get and open coverage report:
rm -f htmlcov/ my_project_coverage_report.tar.gz
databricks fs cp dbfs:/user/$USER_NAME/wheels/my_project_coverage_report.tar.gz .
tar -xvzf my_project_coverage_report.tar.gz
firefox htmlcov/index.html