PyPI is slow. How do I run my own server?

Question:

When a new developer joins the team, or Jenkins runs a complete build, I need to create a fresh virtualenv. I often find that setting up a virtualenv with Pip and a large number (more than 10) of requirements takes a very long time to install everything from PyPI.
Often it fails altogether with:

Downloading/unpacking Django==1.4.5 (from -r requirements.pip (line 1))
Exception:
Traceback (most recent call last):
  File "/var/lib/jenkins/jobs/hermes-web/workspace/web/.venv/lib/python2.6/site-packages/pip-1.2.1-py2.6.egg/pip/basecommand.py", line 107, in main
    status = self.run(options, args)
  File "/var/lib/jenkins/jobs/hermes-web/workspace/web/.venv/lib/python2.6/site-packages/pip-1.2.1-py2.6.egg/pip/commands/install.py", line 256, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/var/lib/jenkins/jobs/hermes-web/workspace/web/.venv/lib/python2.6/site-packages/pip-1.2.1-py2.6.egg/pip/req.py", line 1018, in prepare_files
    self.unpack_url(url, location, self.is_download)
  File "/var/lib/jenkins/jobs/hermes-web/workspace/web/.venv/lib/python2.6/site-packages/pip-1.2.1-py2.6.egg/pip/req.py", line 1142, in unpack_url
    retval = unpack_http_url(link, location, self.download_cache, self.download_dir)
  File "/var/lib/jenkins/jobs/hermes-web/workspace/web/.venv/lib/python2.6/site-packages/pip-1.2.1-py2.6.egg/pip/download.py", line 463, in unpack_http_url
    download_hash = _download_url(resp, link, temp_location)
  File "/var/lib/jenkins/jobs/hermes-web/workspace/web/.venv/lib/python2.6/site-packages/pip-1.2.1-py2.6.egg/pip/download.py", line 380, in _download_url
    chunk = resp.read(4096)
  File "/usr/lib64/python2.6/socket.py", line 353, in read
    data = self._sock.recv(left)
  File "/usr/lib64/python2.6/httplib.py", line 538, in read
    s = self.fp.read(amt)
  File "/usr/lib64/python2.6/socket.py", line 353, in read
    data = self._sock.recv(left)
timeout: timed out

I’m aware of Pip’s --use-mirrors flag, and sometimes people on my team have worked around by using --index-url http://f.pypi.python.org/simple (or another mirror) until they have a mirror that responds in a timely fashion. We’re in the UK, but there’s a PyPI mirror in Germany, and we don’t have issues downloading data from other sites.

So, I’m looking at ways to mirror PyPI internally for our team.

The options I’ve looked at are:

  1. Running my own PyPI instance. There’s the official PyPI implementation: CheeseShop as well as several third party implementations, such as: djangopypi and pypiserver (see footnote)

    The problem with this approach is that I’m not interested in full PyPI functionality with file upload, I just want to mirror the content it provides.

  2. Running a PyPI mirror with pep381client or pypi-mirror.

    This looks like it could work, but it requires my mirror to download everything from PyPI first. I’ve set up a test instance of pep381client, but my download speed varies between 5 Kb/s and 200 Kb/s (bits, not bytes). Unless there’s a copy of the full PyPI archive somewhere, it will take me weeks to have a useful mirror.

  3. Using a PyPI round-robin proxy such as yopypi.

    This is irrelevant now that http://pypi.python.org itself consists of several geographically distinct servers.

  4. Copying around a virtualenv between developers, or hosting a folder of the current project’s dependencies.

    This doesn’t scale: we have several different Python projects whose dependencies change (slowly) over time. As soon as the dependencies of any project change, this central folder must be updated to add the new dependencies. Copying the virtualenv is worse than copying the packages though, since any Python packages with C modules need to be compiled for the target system. Our team has both Linux and OS X users.

    (This still looks like the best option of a bad bunch.)

  5. Using an intelligent PyPI caching proxy: collective.eggproxy

    This seems like it would be a very good solution, but the last version on PyPI is dated 2009 and discusses mod_python.

What do other large Python teams do? What’s the best solution to quickly install the same set of python packages?

Footnotes:

Asked By: Wilfred Hughes

||

Answers:

Setup your local server then modify the local computer’s hosts file to overwrite the actual URL to instead point to the local server thus skipping the standard DNS. Then delete the line in the host file if you are done.

Or I suppose you could find the URL in pip and modify that.

Answered By: user2197172

Do you have a shared filesystem?

Because I would use pip’s cache setting. It’s pretty simple. Make a folder called pip-cache in /mnt for example.

mkdir /mnt/pip-cache

Then each developer would put the following line into their pip config (unix = $HOME/.pip/pip.conf, win = %HOME%pippip.ini)

[global]
download-cache = /mnt/pip-cache

It still checks PyPi, looks for the latest version. Then checks if that version is in the cache. If so it installs it from there. If not it downloads it. Stores it in the cache and installs it. So each package would only be downloaded once per new version.

Answered By: aychedee

While it doesn’t solve your PyPI problem, handing built virtualenvs to developers (or deployments) can be done with Terrarium.

Use terrarium to package up, compress, and save virtualenvs. You can store them locally or even store them on S3. From the documentation on GitHub:

$ pip install terrarium
$ terrarium --target testenv --storage-dir /mnt/storage install requirements.txt

After building a fresh environment, terrarium will archive and compress the environment, and then copy it to the location specified by storage-dir.

On subsequent installs for the same requirement set that specify the same storage-dir, terrarium will copy and extract the compressed archive from /mnt/storage.

To display exactly how terrarium will name the archive, you can run the following command:

$ terrarium key requirements.txt more_requirements.txt
x86_64-2.6-c33a239222ddb1f47fcff08f3ea1b5e1
Answered By: Kyle Kelley

Take a look at David Wolever’s pip2pi. You can just set up a cron job to keep a company- or team-wide mirror of the packages you need, and then point your pips towards your internal mirror.

Answered By: gotgenes

I recently installed devpi into my development team’s Vagrant configuration such that its package cache lives on the host’s file system. This allows each VM to have its own devpi-server daemon that it uses as the index-url for virtualenv/pip. When the VMs are destroyed and reprovisioned, the packages don’t have to be downloaded over and over. Each developer downloads them one time to build their local cache for as long as they live on the host’s file system.

We also have an internal PyPi index for our private packages that’s currently just a directory being served by Apache. Ultimately, I’m going to convert that to a devpi proxy server as well so our build server will also maintain a package cache for our Python dependencies in addition to hosting our private libraries. This will create an additional buffer between our development environment, production deployments and the public PyPi.

This seems to be the most robust solution I’ve found to these requirements to date.

Answered By: Joe Holloway
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.