Parallel Pip install
Question:
Our Django project is getting huge. We have hundreds of apps and use a ton of 3rd party python packages, many of which need to have C compiled. Our deployments are taking a long time when we need to create a new virtual environment for major releases. With that said, I’m looking to speed things up, starting with Pip. Does anyone know of a fork of Pip that will install packages in parallel?
Steps I’ve taken so far:
-
I’ve looked for a project that does just this with little success. I did find this Github Gist: https://gist.github.com/1971720 but the results are almost exactly the same as our single threaded friend.
-
I then found the Pip project on Github and started looking through the network of forks to see if I could find any commits that mentioned doing what I’m trying to do. It’s a mess in there. I will fork it and try to parallelize it myself if I have to, I just want to avoid spending time doing that.
-
I saw a talk at DjangoCon 2011 from ep.io explaining their deployment stuff and they mention parallelizing pip, shipping .so files instead of compiling C and mirroring Pypi, but they didn’t touch on how they did it or what they used.
Answers:
Have you analyzed the deployment process to see where the time really goes? It surprises me that running multiple parallel pip processes does not speed it up much.
If the time goes to querying PyPI and finding the packages (in particular when you also download from Github and other sources) then it may be beneficial to set up your own PyPI. You can host PyPI yourself and add the following to your requirements.txt
file (docs):
--extra-index-url YOUR_URL_HERE
or the following if you wish to replace the official PyPI altogether:
--index-url YOUR_URL_HERE
This may speed up download times as all packages are now found on a nearby machine.
A lot of time also goes into compiling packages with C code, such as PIL. If this turns out to be the bottleneck then it’s worth looking into compiling code in multiple processes. You may even be able to share compiled binaries between your machines (but many things would need to be similar, such as operating system, CPU word length, et cetera)
Will it help if you have your build system (e.g. Jenkins) build and install everything into a build-specific virtual environment directory? When the build succeeds, you make the virtual environment relocatable, tarball it and push the resulting tablall to your "released-tarballs" storage. At deploy time, you need to grab the latest tarball and unpack it on the destination host and then it should be ready to execute. So if it takes 2 seconds to download the tarball and 0.5 seconds to unpack it on the destination host, your deployment will take 2.5 seconds.
The advantage of this approach is that all package installations happen at build time, not at deploy time.
Caveat: your build system worker that builds/compiles/installs things into a virtual env must use same architecture as the target hardware. Also your production box provisioning system will need to take care of various C library dependencies that some Python packages may have (e.g. PIL
requires that libjpeg
installed before it can compile JPEG-related code, also things will break if libjpeg
is not installed on the target box)
It works well for us.
Making a virtual env relocatable:
virtualenv --relocatable /build/output/dir/build-1123423
In this example build-1123423
is a build-specific virtual env directory.
Parallel pip installation
This example uses xargs to parallelize the build process by approximately 4x. You can increase the parallelization factor with max-procs below (keep it approximately equal to your number of cores).
If you’re trying to e.g. speed up an imaging process that you’re doing over and over, it might be easier and definitely lower bandwidth consumption to just image directly on the result rather than do this each time, or build your image using pip -t or virtualenv.
Download and install packages in parallel, four at a time:
xargs --max-args=1 --max-procs=4 sudo pip install < requires.txt
Note: xargs has different parameter names on different Linux distributions. Check your distribution’s man page for specifics.
Same thing inlined using a here-doc:
cat << EOF | xargs --max-args=1 --max-procs=4 sudo pip install
awscli
bottle
paste
boto
wheel
twine
markdown
python-slugify
python-bcrypt
arrow
redis
psutil
requests
requests-aws
EOF
Warning: there is a remote possibility that the speed of this method might confuse package manifests (depending on your distribution) if multiple pip’s try to install the same dependency at exactly the same time, but it’s very unlikely if you’re only doing 4 at a time. It could be fixed pretty easily by pip install --uninstall depname
.
Inspired by Jamieson Becker’s answer, I modified an install script to do parallel pip installs and it seems like and improvement. My bash script now contains a snippet like this:
requirements=''
'numpy '
'scipy '
'Pillow '
'feedgenerator '
'jinja2 '
'docutils '
'argparse '
'pygments '
'Typogrify '
'Markdown '
'jsonschema '
'pyzmq '
'terminado '
'pandas '
'spyder '
'matplotlib '
'statlab '
'ipython[all]>=3 '
'ipdb '
''tornado>=4' '
'simplepam '
'sqlalchemy '
'requests '
'Flask '
'autopep8 '
'python-dateutil '
'pylibmc '
'newrelic '
'markdown '
'elasticsearch '
"'"'docker-py==1.1.0'"'"' '
"'"'pycurl==7.19.5'"'"' '
"'"'futures==2.2.0'"'"' '
"'"'pytz==2014.7'"'"' '
echo requirements=${requirements}
for i in ${requirements}; do ( pip install $i > /tmp/$i.out 2>&1 & ); done
I can at least look for problems manually.
Building on Fatal’s answer, the following code does parallel Pip download, then quickly installs the packages.
First, we download packages in parallel into a distribution (“dist”) directory. This is easily run in parallel with no conflicts. Each package name is printed out is printed out before download, which helps with debugging. For extra help, change the -P9
to -P1
, to download sequentially.
After download, the next command tells Pip to install/update packages. Files are not downloaded, they’re fetched from the fast local directory.
It’s compatible with the current version of Pip 1.7, also with Pip 1.5.
To install only a subset of packages, replace the ‘cat requirements.txt’ statement with your custom command, e.g. ‘egrep -v github requirement.txt’
cat requirements.txt | xargs -t -n1 -P9 pip install -q --download ./dist
pip install --no-index --find-links=./dist -r ./requirements.txt
I come across with a similar issue and I ended up with the below:
cat requirements.txt | sed -e '/^s*#.*$/d' -e '/^s*$/d' | xargs -n 1 python -m pip install
That will read line by line the requirements.txt and execute pip. I cannot find from where I got the answer properly, so apologies for that, but I found some justification below:
- How sed works: https://howto.lintel.in/truncate-empty-lines-using-sed/
- Another similar answer but with git: https://stackoverflow.com/a/46494462/7127519
Hope this help with alternatives. I posted this solution here https://stackoverflow.com/a/63534476/7127519, so maybe there is some help there.
The answer at hand is to use for example poetry if you can which has parallel download/install by default. but question is about pip, so:
If some of you need to install dependencies from requirements.txt
that have hash
parameters and python specifiers (or just hash) you cannot use normal pip install
as it does not support it. Your only choice is to use pip install -r
So the question is how to parallel install from requirements file where each dependency has hash and python specifier defined? Here si how requirements file looks:
swagger-ui-bundle==0.0.9; python_version >= "3.8" and python_version < "4.0"
--hash=sha256:cea116ed81147c345001027325c1ddc9ca78c1ee7319935c3c75d3669279d575
--hash=sha256:b462aa1460261796ab78fd4663961a7f6f347ce01760f1303bbbdf630f11f516
typing-extensions==4.0.1; python_version >= "3.8" and python_version < "4.0"
--hash=sha256:7f001e5ac290a0c0401508864c7ec868be4e701886d5b573a9528ed3973d9d3b
--hash=sha256:4ca091dea149f945ec56afb48dae714f21e8692ef22a395223bcd328961b6a0e
unicon.plugins==21.12; python_version >= "3.8" and python_version < "4.0"
--hash=sha256:07f21f36155ee0ae9040d810065f27b43526185df80d3cc4e3ede597da0a1c72
This is what I came with:
# create temp directory where we store split requirements
mkdir -p pip_install
# join lines that are separated with `` and split each line into separate
# requirements file (one dependency == one file),
# and save files in previously created temp directory
sed ':x; /\$/ { N; s/\n//; tx }' requirements.txt | split -l 1 - pip_install/x
# collect all file paths from temp directory and pipe them to xargs and pip
find pip_install -type f | xargs -t -L 1 -P$(nproc) /usr/bin/python3 -mpip install -r
# remove temp dir
rm -rf pip_install
Our Django project is getting huge. We have hundreds of apps and use a ton of 3rd party python packages, many of which need to have C compiled. Our deployments are taking a long time when we need to create a new virtual environment for major releases. With that said, I’m looking to speed things up, starting with Pip. Does anyone know of a fork of Pip that will install packages in parallel?
Steps I’ve taken so far:
-
I’ve looked for a project that does just this with little success. I did find this Github Gist: https://gist.github.com/1971720 but the results are almost exactly the same as our single threaded friend.
-
I then found the Pip project on Github and started looking through the network of forks to see if I could find any commits that mentioned doing what I’m trying to do. It’s a mess in there. I will fork it and try to parallelize it myself if I have to, I just want to avoid spending time doing that.
-
I saw a talk at DjangoCon 2011 from ep.io explaining their deployment stuff and they mention parallelizing pip, shipping .so files instead of compiling C and mirroring Pypi, but they didn’t touch on how they did it or what they used.
Have you analyzed the deployment process to see where the time really goes? It surprises me that running multiple parallel pip processes does not speed it up much.
If the time goes to querying PyPI and finding the packages (in particular when you also download from Github and other sources) then it may be beneficial to set up your own PyPI. You can host PyPI yourself and add the following to your requirements.txt
file (docs):
--extra-index-url YOUR_URL_HERE
or the following if you wish to replace the official PyPI altogether:
--index-url YOUR_URL_HERE
This may speed up download times as all packages are now found on a nearby machine.
A lot of time also goes into compiling packages with C code, such as PIL. If this turns out to be the bottleneck then it’s worth looking into compiling code in multiple processes. You may even be able to share compiled binaries between your machines (but many things would need to be similar, such as operating system, CPU word length, et cetera)
Will it help if you have your build system (e.g. Jenkins) build and install everything into a build-specific virtual environment directory? When the build succeeds, you make the virtual environment relocatable, tarball it and push the resulting tablall to your "released-tarballs" storage. At deploy time, you need to grab the latest tarball and unpack it on the destination host and then it should be ready to execute. So if it takes 2 seconds to download the tarball and 0.5 seconds to unpack it on the destination host, your deployment will take 2.5 seconds.
The advantage of this approach is that all package installations happen at build time, not at deploy time.
Caveat: your build system worker that builds/compiles/installs things into a virtual env must use same architecture as the target hardware. Also your production box provisioning system will need to take care of various C library dependencies that some Python packages may have (e.g. PIL
requires that libjpeg
installed before it can compile JPEG-related code, also things will break if libjpeg
is not installed on the target box)
It works well for us.
Making a virtual env relocatable:
virtualenv --relocatable /build/output/dir/build-1123423
In this example build-1123423
is a build-specific virtual env directory.
Parallel pip installation
This example uses xargs to parallelize the build process by approximately 4x. You can increase the parallelization factor with max-procs below (keep it approximately equal to your number of cores).
If you’re trying to e.g. speed up an imaging process that you’re doing over and over, it might be easier and definitely lower bandwidth consumption to just image directly on the result rather than do this each time, or build your image using pip -t or virtualenv.
Download and install packages in parallel, four at a time:
xargs --max-args=1 --max-procs=4 sudo pip install < requires.txt
Note: xargs has different parameter names on different Linux distributions. Check your distribution’s man page for specifics.
Same thing inlined using a here-doc:
cat << EOF | xargs --max-args=1 --max-procs=4 sudo pip install
awscli
bottle
paste
boto
wheel
twine
markdown
python-slugify
python-bcrypt
arrow
redis
psutil
requests
requests-aws
EOF
Warning: there is a remote possibility that the speed of this method might confuse package manifests (depending on your distribution) if multiple pip’s try to install the same dependency at exactly the same time, but it’s very unlikely if you’re only doing 4 at a time. It could be fixed pretty easily by pip install --uninstall depname
.
Inspired by Jamieson Becker’s answer, I modified an install script to do parallel pip installs and it seems like and improvement. My bash script now contains a snippet like this:
requirements=''
'numpy '
'scipy '
'Pillow '
'feedgenerator '
'jinja2 '
'docutils '
'argparse '
'pygments '
'Typogrify '
'Markdown '
'jsonschema '
'pyzmq '
'terminado '
'pandas '
'spyder '
'matplotlib '
'statlab '
'ipython[all]>=3 '
'ipdb '
''tornado>=4' '
'simplepam '
'sqlalchemy '
'requests '
'Flask '
'autopep8 '
'python-dateutil '
'pylibmc '
'newrelic '
'markdown '
'elasticsearch '
"'"'docker-py==1.1.0'"'"' '
"'"'pycurl==7.19.5'"'"' '
"'"'futures==2.2.0'"'"' '
"'"'pytz==2014.7'"'"' '
echo requirements=${requirements}
for i in ${requirements}; do ( pip install $i > /tmp/$i.out 2>&1 & ); done
I can at least look for problems manually.
Building on Fatal’s answer, the following code does parallel Pip download, then quickly installs the packages.
First, we download packages in parallel into a distribution (“dist”) directory. This is easily run in parallel with no conflicts. Each package name is printed out is printed out before download, which helps with debugging. For extra help, change the -P9
to -P1
, to download sequentially.
After download, the next command tells Pip to install/update packages. Files are not downloaded, they’re fetched from the fast local directory.
It’s compatible with the current version of Pip 1.7, also with Pip 1.5.
To install only a subset of packages, replace the ‘cat requirements.txt’ statement with your custom command, e.g. ‘egrep -v github requirement.txt’
cat requirements.txt | xargs -t -n1 -P9 pip install -q --download ./dist
pip install --no-index --find-links=./dist -r ./requirements.txt
I come across with a similar issue and I ended up with the below:
cat requirements.txt | sed -e '/^s*#.*$/d' -e '/^s*$/d' | xargs -n 1 python -m pip install
That will read line by line the requirements.txt and execute pip. I cannot find from where I got the answer properly, so apologies for that, but I found some justification below:
- How sed works: https://howto.lintel.in/truncate-empty-lines-using-sed/
- Another similar answer but with git: https://stackoverflow.com/a/46494462/7127519
Hope this help with alternatives. I posted this solution here https://stackoverflow.com/a/63534476/7127519, so maybe there is some help there.
The answer at hand is to use for example poetry if you can which has parallel download/install by default. but question is about pip, so:
If some of you need to install dependencies from requirements.txt
that have hash
parameters and python specifiers (or just hash) you cannot use normal pip install
as it does not support it. Your only choice is to use pip install -r
So the question is how to parallel install from requirements file where each dependency has hash and python specifier defined? Here si how requirements file looks:
swagger-ui-bundle==0.0.9; python_version >= "3.8" and python_version < "4.0"
--hash=sha256:cea116ed81147c345001027325c1ddc9ca78c1ee7319935c3c75d3669279d575
--hash=sha256:b462aa1460261796ab78fd4663961a7f6f347ce01760f1303bbbdf630f11f516
typing-extensions==4.0.1; python_version >= "3.8" and python_version < "4.0"
--hash=sha256:7f001e5ac290a0c0401508864c7ec868be4e701886d5b573a9528ed3973d9d3b
--hash=sha256:4ca091dea149f945ec56afb48dae714f21e8692ef22a395223bcd328961b6a0e
unicon.plugins==21.12; python_version >= "3.8" and python_version < "4.0"
--hash=sha256:07f21f36155ee0ae9040d810065f27b43526185df80d3cc4e3ede597da0a1c72
This is what I came with:
# create temp directory where we store split requirements
mkdir -p pip_install
# join lines that are separated with `` and split each line into separate
# requirements file (one dependency == one file),
# and save files in previously created temp directory
sed ':x; /\$/ { N; s/\n//; tx }' requirements.txt | split -l 1 - pip_install/x
# collect all file paths from temp directory and pipe them to xargs and pip
find pip_install -type f | xargs -t -L 1 -P$(nproc) /usr/bin/python3 -mpip install -r
# remove temp dir
rm -rf pip_install