Preventing namespace collisions between private and pypi-based Python packages
Question:
We have 100+ private packages and so far we’ve been using s3pypi to set up a private pypi in an s3 bucket. Our private packages have dependencies on each other (and on public packages), and it is (of course) important that our GitLab pipelines find the latest functional version of packages it relies on. I.e. we’re not interested in the latest checked in code. We create new wheels only after tests and qa has run against a push to master (which is a long-winded way of explaining that -e <vcs>
requirements will not work).
Our setup works really well until someone creates a new public package on the official pypi that shadows one of our package names. We can force our private package to be chosen by increasing the version number so it is higher than the new package on pypi.org – or by renaming our package to something that haven’t yet been taken on pypi.org.
This is obviously a hacky and fragile solution, but apparently the functionality is this way by-design.
After the initial bucket setup s3pypi has required no maintenance or administration. The above ticket suggests using devpi but that seems like a very heavy solution that requires administration/monitoring/etc.
GitLab’s pypi solution seems to be at individual package level (meaning we’d have to list up to 100+ urls – one for each package). This doesn’t seem practical, but maybe I’m not understanding something (I can see the package registry menu under our group as well, but the docs point to the "package-pypi" docs).
We can’t be the first small company that has faced this issue..? Is there a better way than to register dummy versions of all our packages on pypi.org (with version=0.0.1, so the s3pypi version will be preferred)?
Answers:
It might not be the solution for you, but I tell what we do.
- Prefix the package names, and using namespaces (eg.
company.product.tool
).
- When we install our packages (including their in-house dependencies), we use a
requirements.txt
file including our PyPI URL. We run everything in container(s) and we install all public dependencies in them when we are building the images.
Your company could redirect all requests to pypi to a service you control first (perhaps just at your build servers’ hosts
file(s))
This would potentially allow you to
- prefer/override arbitrary packages with local ones
- detect such cases
- cache common/large upstream packages locally
- reject suspect/non-known versions/names of upstream packages
We use VCS for this. I see you’ve explicitly ruled that out, but have you considered using branches to mark your latest stable builds in VCS?
If you aren’t interested in the latest version of master or the dev branch, but you are running test/QA against commits, then I would configure your test/QA suite to merge into a branch named something like "stable" or "pypi-stable" and then your requirements files look like this:
pip install git+https://gitlab.com/yourorg/yourpackage.git@pypi-stable
The same configuration will work for setup.py requirements blocks (which allows for chained internal dependencies).
Am I missing something?
You could perhaps get the behavior you are looking for from a requirements.txt
and two pip
calls:
cat requirements.txt | xargs -n 1 pip install -i <your-s3pipy>
pip install -r requirements.txt
The first one tries to install what it can from your local repository and ignores a package if it fails. The second call tries to install everything that failed before from pipy.
This works because --upgrade-strategy only-if-needed
is the default (as of pip 10.X I believe, don’t quote me on that). If you are using an old pip you may have to specify this manually.
A limitation of this approach is if you expect/request a local package, but it doesn’t exist and a package with the same name exists on pipy. In this case, you will get that package instead. Not sure if that is a concern.
The comment from @a_guest on my first answer got me thinking, and the "problem" is that pip doesn’t consider where the package originated when it sorts through candidates to satisfy requirements.
So here is a possible way to change this: Monkey-patch pip and introduce a preference over indexes.
from __future__ import absolute_import
import os
import sys
import pip
from pip._internal.index.package_finder import CandidateEvaluator
class MyCandidateEvaluator(CandidateEvaluator):
def _sort_key(self, candidate):
(has_allowed_hash, yank_value, binary_preference, candidate.version,
build_tag, pri) = super()._sort_key(candidate)
priority_index = "localhost" #use your s3pipy here
if priority_index in candidate.link.comes_from:
priority = 1
else:
priority = 0
return (has_allowed_hash, yank_value, binary_preference, priority,
candidate.version, build_tag, pri)
pip._internal.index.package_finder.CandidateEvaluator = MyCandidateEvaluator
# Remove '' and current working directory from the first entry
# of sys.path, if present to avoid using current directory
# in pip commands check, freeze, install, list and show,
# when invoked as python -m pip <command>
if sys.path[0] in ('', os.getcwd()):
sys.path.pop(0)
# If we are running from a wheel, add the wheel to sys.path
# This allows the usage python pip-*.whl/pip install pip-*.whl
if __package__ == '':
# __file__ is pip-*.whl/pip/__main__.py
# first dirname call strips of '/__main__.py', second strips off '/pip'
# Resulting path is the name of the wheel itself
# Add that to sys.path so we can import pip
path = os.path.dirname(os.path.dirname(__file__))
sys.path.insert(0, path)
from pip._internal.cli.main import main as _main # isort:skip # noqa
if __name__ == '__main__':
sys.exit(_main())
setup a requirements.txt
numpy
sampleproject
and call above script using the same parameters as you’d use for pip
.
>python mypip.py install --no-cache --extra-index http://localhost:8000 -r requirements.txt
Looking in indexes: https://pypi.org/simple, http://localhost:8000
Collecting numpy
Downloading numpy-1.19.1-cp37-cp37m-win_amd64.whl (12.9 MB)
|████████████████████████████████| 12.9 MB 6.8 MB/s
Collecting sampleproject
Downloading http://localhost:8000/sampleproject/sampleproject-0.5.0-py2.py3-none-any.whl (4.3 kB)
Collecting peppercorn
Downloading peppercorn-0.6-py3-none-any.whl (4.8 kB)
Installing collected packages: numpy, peppercorn, sampleproject
Successfully installed numpy-1.19.1 peppercorn-0.6 sampleproject-0.5.0
Compare this to the default pip call
>pip install --no-cache --extra-index http://localhost:8000 -r requirements.txt
Looking in indexes: https://pypi.org/simple, http://localhost:8000
Collecting numpy
Downloading numpy-1.19.1-cp37-cp37m-win_amd64.whl (12.9 MB)
|████████████████████████████████| 12.9 MB 6.4 MB/s
Collecting sampleproject
Downloading sampleproject-2.0.0-py3-none-any.whl (4.2 kB)
Collecting peppercorn
Downloading peppercorn-0.6-py3-none-any.whl (4.8 kB)
Installing collected packages: numpy, peppercorn, sampleproject
Successfully installed numpy-1.19.1 peppercorn-0.6 sampleproject-2.0.0
And notice that mypip
prefers a package if it can be retrieved from localhost
; ofc you can customize this behavior further.
We have 100+ private packages and so far we’ve been using s3pypi to set up a private pypi in an s3 bucket. Our private packages have dependencies on each other (and on public packages), and it is (of course) important that our GitLab pipelines find the latest functional version of packages it relies on. I.e. we’re not interested in the latest checked in code. We create new wheels only after tests and qa has run against a push to master (which is a long-winded way of explaining that -e <vcs>
requirements will not work).
Our setup works really well until someone creates a new public package on the official pypi that shadows one of our package names. We can force our private package to be chosen by increasing the version number so it is higher than the new package on pypi.org – or by renaming our package to something that haven’t yet been taken on pypi.org.
This is obviously a hacky and fragile solution, but apparently the functionality is this way by-design.
After the initial bucket setup s3pypi has required no maintenance or administration. The above ticket suggests using devpi but that seems like a very heavy solution that requires administration/monitoring/etc.
GitLab’s pypi solution seems to be at individual package level (meaning we’d have to list up to 100+ urls – one for each package). This doesn’t seem practical, but maybe I’m not understanding something (I can see the package registry menu under our group as well, but the docs point to the "package-pypi" docs).
We can’t be the first small company that has faced this issue..? Is there a better way than to register dummy versions of all our packages on pypi.org (with version=0.0.1, so the s3pypi version will be preferred)?
It might not be the solution for you, but I tell what we do.
- Prefix the package names, and using namespaces (eg.
company.product.tool
). - When we install our packages (including their in-house dependencies), we use a
requirements.txt
file including our PyPI URL. We run everything in container(s) and we install all public dependencies in them when we are building the images.
Your company could redirect all requests to pypi to a service you control first (perhaps just at your build servers’ hosts
file(s))
This would potentially allow you to
- prefer/override arbitrary packages with local ones
- detect such cases
- cache common/large upstream packages locally
- reject suspect/non-known versions/names of upstream packages
We use VCS for this. I see you’ve explicitly ruled that out, but have you considered using branches to mark your latest stable builds in VCS?
If you aren’t interested in the latest version of master or the dev branch, but you are running test/QA against commits, then I would configure your test/QA suite to merge into a branch named something like "stable" or "pypi-stable" and then your requirements files look like this:
pip install git+https://gitlab.com/yourorg/yourpackage.git@pypi-stable
The same configuration will work for setup.py requirements blocks (which allows for chained internal dependencies).
Am I missing something?
You could perhaps get the behavior you are looking for from a requirements.txt
and two pip
calls:
cat requirements.txt | xargs -n 1 pip install -i <your-s3pipy>
pip install -r requirements.txt
The first one tries to install what it can from your local repository and ignores a package if it fails. The second call tries to install everything that failed before from pipy.
This works because --upgrade-strategy only-if-needed
is the default (as of pip 10.X I believe, don’t quote me on that). If you are using an old pip you may have to specify this manually.
A limitation of this approach is if you expect/request a local package, but it doesn’t exist and a package with the same name exists on pipy. In this case, you will get that package instead. Not sure if that is a concern.
The comment from @a_guest on my first answer got me thinking, and the "problem" is that pip doesn’t consider where the package originated when it sorts through candidates to satisfy requirements.
So here is a possible way to change this: Monkey-patch pip and introduce a preference over indexes.
from __future__ import absolute_import
import os
import sys
import pip
from pip._internal.index.package_finder import CandidateEvaluator
class MyCandidateEvaluator(CandidateEvaluator):
def _sort_key(self, candidate):
(has_allowed_hash, yank_value, binary_preference, candidate.version,
build_tag, pri) = super()._sort_key(candidate)
priority_index = "localhost" #use your s3pipy here
if priority_index in candidate.link.comes_from:
priority = 1
else:
priority = 0
return (has_allowed_hash, yank_value, binary_preference, priority,
candidate.version, build_tag, pri)
pip._internal.index.package_finder.CandidateEvaluator = MyCandidateEvaluator
# Remove '' and current working directory from the first entry
# of sys.path, if present to avoid using current directory
# in pip commands check, freeze, install, list and show,
# when invoked as python -m pip <command>
if sys.path[0] in ('', os.getcwd()):
sys.path.pop(0)
# If we are running from a wheel, add the wheel to sys.path
# This allows the usage python pip-*.whl/pip install pip-*.whl
if __package__ == '':
# __file__ is pip-*.whl/pip/__main__.py
# first dirname call strips of '/__main__.py', second strips off '/pip'
# Resulting path is the name of the wheel itself
# Add that to sys.path so we can import pip
path = os.path.dirname(os.path.dirname(__file__))
sys.path.insert(0, path)
from pip._internal.cli.main import main as _main # isort:skip # noqa
if __name__ == '__main__':
sys.exit(_main())
setup a requirements.txt
numpy
sampleproject
and call above script using the same parameters as you’d use for pip
.
>python mypip.py install --no-cache --extra-index http://localhost:8000 -r requirements.txt
Looking in indexes: https://pypi.org/simple, http://localhost:8000
Collecting numpy
Downloading numpy-1.19.1-cp37-cp37m-win_amd64.whl (12.9 MB)
|████████████████████████████████| 12.9 MB 6.8 MB/s
Collecting sampleproject
Downloading http://localhost:8000/sampleproject/sampleproject-0.5.0-py2.py3-none-any.whl (4.3 kB)
Collecting peppercorn
Downloading peppercorn-0.6-py3-none-any.whl (4.8 kB)
Installing collected packages: numpy, peppercorn, sampleproject
Successfully installed numpy-1.19.1 peppercorn-0.6 sampleproject-0.5.0
Compare this to the default pip call
>pip install --no-cache --extra-index http://localhost:8000 -r requirements.txt
Looking in indexes: https://pypi.org/simple, http://localhost:8000
Collecting numpy
Downloading numpy-1.19.1-cp37-cp37m-win_amd64.whl (12.9 MB)
|████████████████████████████████| 12.9 MB 6.4 MB/s
Collecting sampleproject
Downloading sampleproject-2.0.0-py3-none-any.whl (4.2 kB)
Collecting peppercorn
Downloading peppercorn-0.6-py3-none-any.whl (4.8 kB)
Installing collected packages: numpy, peppercorn, sampleproject
Successfully installed numpy-1.19.1 peppercorn-0.6 sampleproject-2.0.0
And notice that mypip
prefers a package if it can be retrieved from localhost
; ofc you can customize this behavior further.