Using setuptools, how can I download external data upon installation?

Question:

I’d like to create some ridiculously-easy-to-use pip packages for loading common machine-learning datasets in Python. (Yes, some stuff already exists, but I want it to be even simpler.)

What I’d like to achieve is this:

  • User runs pip install dataset
  • pip downloads the dataset, say via wget http://mydata.com/data.tar.gz. Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
  • pip extracts the data from this file and puts it in the directory that the package is installed in. (This isn’t ideal, but the datasets are pretty small, so let’s assume storing the data here isn’t a big deal.)
  • Later, when the user imports my module, the module automatically loads the data from the specific location.

This question is about bullets 2 and 3. Is there a way to do this with setuptools?

Asked By: rd11

||

Answers:

Python package installation states that it should never execute Python code in order to install Python packages. This means that you may not be able to download stuff during the installation process.

If you want to download some additional data, do it after you install the package , for example when you import your package you could download this data and cache it somewhere in order not to download it at every new import.

Answered By: sorin

Note that the data does not reside in the python package itself, but is downloaded from somewhere else.

Please do not do this.

The whole point of Python packaging is to provide a completely deterministic, repeatable, and reusable means of installing exactly the same thing every time. Your proposal has the following problems at a minimum:

  • The end user might download your package on computer A, stick it on a thumb drive, and then install it on computer B which does not have internet.
  • The data on the web might change, meaning that two people who install the same exact package get different results.
  • The website that provides the data might cease to exist or unwisely change the URL, meaning people who still have the package won’t be able to use it.
  • The user could be behind an internet filter, and you might get a useless “this page is blocked” HTML file instead of the dataset you were expecting.

Instead, you should either include your data with the package (using the package_data or data_files arguments to setup()), or provide a separate top-level function in your Python code to download the data manually when the user is ready to do so.

Answered By: Kevin

As alluded to by Kevin, Python package installs should be completely reproducible, and any potential external-download issues should be pushed to runtime. This therefore shouldn’t be handled with setuptools.

Instead, to avoid burdening the user, consider downloading the data in a lazy way, upon load. Example:

def download_data(url='http://...'):
    # Download; extract data to disk.
    # Raise an exception if the link is bad, or we can't connect, etc.

def load_data():
    if not os.path.exists(DATA_DIR):
        download_data()
    data = read_data_from_disk(DATA_DIR)
    return data

We could then describe download_data in the docs, but the majority of users would never need to bother with it. This is somewhat similar to the behavior in the imageio module with respect to downloading necessary decoders at runtime, rather than making the user manage the external downloads themselves.

Answered By: rd11

This question is rather old, but I want to add that downloading external data at installation time is of course much better than forcing to download external content at runtime.

The original problem is, that one cannot package arbitrary content into a Python package, if it exceeds the max. size limit of the package registry. This size limit effectively breaks up the relationship of the packaged Python code and the data it operates on. Suddenly things that belong together have to be separated and the package creator needs to take care about versioning and availability of external data. If the size limits are met, everything is installed at installation time and the discussion would be over here. I want to stress, that data & algorithms belong together and are normally installed at the same time, not at some later date. That’s the whole point of package integrity. If you cannot install a package, because the external content cannot be downloaded, you want to know at installation time.

In the light of Docker & friends, downloading data at runtime makes a container non-reproducible and forces the download of the external content at each start of the container unless you additionally add the path where the data is downloaded to a Docker volume. But then you need to know where exactly this content is downloaded and the user/Dockerfile creator has to know more unnecessary details. There are more issues in using volumes in that regard.

Moreover, content fetched at runtime cannot be cached automatically by Docker, i.e. you need to fetch every time after a docker build.

Then again one could argue, that one should provide a function/executable script that downloads this external content and the user should execute this script directly after installation. Again the user of the package needs to know more than necessary, because someone or some commitee proclaims, executing Python code or downloading external content at installation time is not "recommended".
But forcing the user to run an extra script directly after installation of a package is factually the same as downloading the content directly inside a post-installation step, just more user-unfriendly. Thinking about how popular machine learning is today, the growing size of models and popularity of ML in the future, there will be a lot of scripts to be executed for just a handful of Python package dependencies for model downloads in the near future according to this argumentation.

The only time I see a benefit for an extra script, is when you can choose to download between several different versions of the external content, but then one intentionally involves the user into that decision.

But coming back to the runtime on-demand lazy model download, where the user doesn’t need to be involved into executing an extra script: let’s assume, the user packages the container, all tests pass successfully on the CI and he/she distributes it to Dockerhub or any other container registry and starts production. Nobody then wants the situation of random fails, because a successfully installed package intermittently downloads content from time to time e.g. after some maintainence task happens like cleaning up docker volumes or if distributing containers on new k8s nodes and the first request to a web app times out because external content is always fetched at startup. Or not fetched at all, because the external URL is in maintenance mode. That’s a nightmare!

If it would be allowed to have reasonably sized Python packages, the whole problem would be much less of an issue. E.g. in contrast, the biggest Ruby gems (i.e. packages in the Ruby ecosystem) are over 700MB big and of course it’s allowed to download external content at installation time.

Answered By: lumpidu
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.