How can I develop with Python libraries in editable mode on databricks?

Question:

On Databricks, it is possible to install Python packages directly from a git repo, or from the dbfs:

%pip install git+https://github/myrepo
%pip install /dbfs/my-library-0.0.0-py3-none-any.whl 

Is there a way to enable a live package development mode, similar to the usage of pip install -e, such that the databricks notebook references the library files as is, and it’s possible to update the library files on the go?

E.g. something like

%pip install /dbfs/my-library/ -e

combined with a way to keep my-library up-to-date?

Thanks!

Asked By: elke

||

Answers:

I would recommend to adopt the Databricks Repos functionality that allows to import Python code into a notebook as a normal package, including the automatic reload of the code when Python package code changes.

You need to add the following two lines to your notebook that uses the Python package that you’re developing:

%load_ext autoreload
%autoreload 2

Your library is recognized as the Databricks Repos main folders are automatically added to sys.path. If your library is in a Repo subfolder, you can add it via:

import os, sys
sys.path.append(os.path.abspath('/Workspace/Repos/<username>/path/to/your/library'))

This works for the notebook node, however not for worker nodes.

P.S. You can see examples in this Databricks cookbook and in this repository.

Answered By: Alex Ott

You can do %pip install -e in notebook scope. But you will need to do that every time reattach. The code changes does not seem to reload with auto reload since editable mode does not append to syspath; rather a symblink on site-packages.

However editable mode in cluster scope does not seem to work for me

Answered By: Carey

I did some more test and here are my findings for pip install editable:

(1) If I am currently working on /Workspace/xxx/Repo1, and %pip install -e /Workspace/xxx/Repo2 at Notebook scope, it only get recognized in driver node but not worker nodes when you run RDD. When I did "%pip install -e /Workspace/xxx/Repo2" as notebook scope, the class function in Repo2 I called from Repo1 is fine if such function is used only in driver node. But it failed in worker node as worker node does not append the sys.path with /Workspace/xxx/Repo2. Apparently worker node path is out of sync from driver node after %pip install editable mode.

(2) Manually append sys.path of /Workspace/xxx/Repo2 when working on a notebook at /Workspace/xxx/Repo1: this also works only in driver node but not worker node. To make it work in worker node, you need to append the same sys.path in each worker node job function submission, which is not ideal.

(3) install editable /Workspace/xxx/Repo2 at init-script: this works in both driver node and worker node as this environment path is initialized at cluster init stage. This is the best option in my opinion as it ensure consistency across all notebooks. The only downside is /Workspace is not mounted at cluster init stage so /Workspace is not accessible. I can only make it work for when pip install -e /dbfs/xxx/Repo2

Answered By: Carey
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.