How do I reduce a python (docker) image size using a multi-stage build?
Question:
I am looking for a way to create multistage builds with python and Dockerfile:
For example, using the following images:
1st image: install all compile-time requirements, and install all needed python modules
2nd image: copy all compiled/built packages from the first image to the second, without the compilers themselves (gcc, postgers-dev, python-dev, etc..)
The final objective is to have a smaller image, running python and the python packages that I need.
In short: how can I ‘wrap’ all the compiled modules (site-packages / external libs) that were created in the first image, and copy them in a ‘clean’ manner, to the 2nd image.
Answers:
The docs on this explain exactly how to do this.
https://docs.docker.com/engine/userguide/eng-image/multistage-build/#before-multi-stage-builds
Basically you do exactly what you’ve said. The magic of multistage build feature though is that you can do this all from one dockerfile.
ie:
FROM golang:1.7.3
WORKDIR /go/src/github.com/alexellis/href-counter/
RUN go get -d -v golang.org/x/net/html
COPY app.go .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=0 /go/src/github.com/alexellis/href-counter/app .
CMD ["./app"]
This builds a go binary, then the next image runs the binary. The first image has all the build tools and the seccond is just a base linux machine that can run a binary.
ok so my solution is using wheel, it lets us compile on first image, create wheel files for all dependencies and install them in the second image, without installing the compilers
FROM python:2.7-alpine as base
RUN mkdir /svc
COPY . /svc
WORKDIR /svc
RUN apk add --update
postgresql-dev
gcc
musl-dev
linux-headers
RUN pip install wheel && pip wheel . --wheel-dir=/svc/wheels
FROM python:2.7-alpine
COPY --from=base /svc /svc
WORKDIR /svc
RUN pip install --no-index --find-links=/svc/wheels -r requirements.txt
You can see my answer regarding this in the following blog post
https://www.blogfoobar.com/post/2018/02/10/python-and-docker-multistage-build
I recommend the approach detailed in this article (section 2). He uses virtualenv so pip install stores all the python code, binaries, etc. under one folder instead of spread out all over the file system. Then it’s easy to copy just that one folder to the final “production” image. In summary:
Compile image
- Activate virtualenv in some path of your choosing.
- Prepend that path to your docker ENV. This is all virtualenv needs to function for all future docker RUN and CMD action.
- Install system dev packages and
pip install xyz
as usual.
Production image
- Copy the virtualenv folder from the Compile Image.
- Prepend the virtualenv folder to docker’s PATH
This is a place where using a Python virtual environment inside Docker can be useful. Copying a virtual environment normally is tricky since it needs to be the exact same filesystem path on the exact same Python build, but in Docker you can guarantee that.
(This is the same basic recipe @mpoisot describes in their answer and it appears in other SO answers as well.)
Say you’re installing the psycopg PostgreSQL client library. The extended form of this requires the Python C development library plus the PostgreSQL C client library headers; but to run it you only need the PostgreSQL C runtime library. So here you can use a multi-stage build: the first stage installs the virtual environment using the full C toolchain, and the final stage copies the built virtual environment but only includes the minimum required libraries.
A typical Dockerfile could look like:
# Name the single Python image we're using everywhere.
ARG python=python:3.10-slim
# Build stage:
FROM ${python} AS build
# Install a full C toolchain and C build-time dependencies for
# everything we're going to need.
RUN apt-get update
&& DEBIAN_FRONTEND=noninteractive
apt-get install --no-install-recommends --assume-yes
build-essential
libpq-dev
# Create the virtual environment.
RUN python3 -m venv /venv
ENV PATH=/venv/bin:$PATH
# Install the Python library dependencies, including those with
# C extensions. They'll get installed into the virtual environment.
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
# Final stage:
FROM ${python}
# Install the runtime-only C library dependencies we need.
RUN apt-get update
&& DEBIAN_FRONTEND=noninteractive
apt-get install --no-install-recommends --assume-yes
libpq5
# Copy the virtual environment from the first stage.
COPY --from=build /venv /venv
ENV PATH=/venv/bin:$PATH
# Copy the application in.
COPY . .
CMD ["./main.py"]
If your application uses a Python entry point script then you can do everything in the first stage: RUN pip install .
will copy the application into the virtual environment and create a wrapper script in /venv/bin
for you. In the final stage you don’t need to COPY
the application again. Set the CMD
to run the wrapper script out of the virtual environment, which is already at the front of the $PATH
.
Again, note that this approach only works because it is the same Python base image in both stages, and because the virtual environment is on the exact same path. If it is a different Python or a different container path the transplanted virtual environment may not work correctly.
I am looking for a way to create multistage builds with python and Dockerfile:
For example, using the following images:
1st image: install all compile-time requirements, and install all needed python modules
2nd image: copy all compiled/built packages from the first image to the second, without the compilers themselves (gcc, postgers-dev, python-dev, etc..)
The final objective is to have a smaller image, running python and the python packages that I need.
In short: how can I ‘wrap’ all the compiled modules (site-packages / external libs) that were created in the first image, and copy them in a ‘clean’ manner, to the 2nd image.
The docs on this explain exactly how to do this.
https://docs.docker.com/engine/userguide/eng-image/multistage-build/#before-multi-stage-builds
Basically you do exactly what you’ve said. The magic of multistage build feature though is that you can do this all from one dockerfile.
ie:
FROM golang:1.7.3
WORKDIR /go/src/github.com/alexellis/href-counter/
RUN go get -d -v golang.org/x/net/html
COPY app.go .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=0 /go/src/github.com/alexellis/href-counter/app .
CMD ["./app"]
This builds a go binary, then the next image runs the binary. The first image has all the build tools and the seccond is just a base linux machine that can run a binary.
ok so my solution is using wheel, it lets us compile on first image, create wheel files for all dependencies and install them in the second image, without installing the compilers
FROM python:2.7-alpine as base
RUN mkdir /svc
COPY . /svc
WORKDIR /svc
RUN apk add --update
postgresql-dev
gcc
musl-dev
linux-headers
RUN pip install wheel && pip wheel . --wheel-dir=/svc/wheels
FROM python:2.7-alpine
COPY --from=base /svc /svc
WORKDIR /svc
RUN pip install --no-index --find-links=/svc/wheels -r requirements.txt
You can see my answer regarding this in the following blog post
https://www.blogfoobar.com/post/2018/02/10/python-and-docker-multistage-build
I recommend the approach detailed in this article (section 2). He uses virtualenv so pip install stores all the python code, binaries, etc. under one folder instead of spread out all over the file system. Then it’s easy to copy just that one folder to the final “production” image. In summary:
Compile image
- Activate virtualenv in some path of your choosing.
- Prepend that path to your docker ENV. This is all virtualenv needs to function for all future docker RUN and CMD action.
- Install system dev packages and
pip install xyz
as usual.
Production image
- Copy the virtualenv folder from the Compile Image.
- Prepend the virtualenv folder to docker’s PATH
This is a place where using a Python virtual environment inside Docker can be useful. Copying a virtual environment normally is tricky since it needs to be the exact same filesystem path on the exact same Python build, but in Docker you can guarantee that.
(This is the same basic recipe @mpoisot describes in their answer and it appears in other SO answers as well.)
Say you’re installing the psycopg PostgreSQL client library. The extended form of this requires the Python C development library plus the PostgreSQL C client library headers; but to run it you only need the PostgreSQL C runtime library. So here you can use a multi-stage build: the first stage installs the virtual environment using the full C toolchain, and the final stage copies the built virtual environment but only includes the minimum required libraries.
A typical Dockerfile could look like:
# Name the single Python image we're using everywhere.
ARG python=python:3.10-slim
# Build stage:
FROM ${python} AS build
# Install a full C toolchain and C build-time dependencies for
# everything we're going to need.
RUN apt-get update
&& DEBIAN_FRONTEND=noninteractive
apt-get install --no-install-recommends --assume-yes
build-essential
libpq-dev
# Create the virtual environment.
RUN python3 -m venv /venv
ENV PATH=/venv/bin:$PATH
# Install the Python library dependencies, including those with
# C extensions. They'll get installed into the virtual environment.
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
# Final stage:
FROM ${python}
# Install the runtime-only C library dependencies we need.
RUN apt-get update
&& DEBIAN_FRONTEND=noninteractive
apt-get install --no-install-recommends --assume-yes
libpq5
# Copy the virtual environment from the first stage.
COPY --from=build /venv /venv
ENV PATH=/venv/bin:$PATH
# Copy the application in.
COPY . .
CMD ["./main.py"]
If your application uses a Python entry point script then you can do everything in the first stage: RUN pip install .
will copy the application into the virtual environment and create a wrapper script in /venv/bin
for you. In the final stage you don’t need to COPY
the application again. Set the CMD
to run the wrapper script out of the virtual environment, which is already at the front of the $PATH
.
Again, note that this approach only works because it is the same Python base image in both stages, and because the virtual environment is on the exact same path. If it is a different Python or a different container path the transplanted virtual environment may not work correctly.