How do I reduce a python (docker) image size using a multi-stage build?

Question:

I am looking for a way to create multistage builds with python and Dockerfile:

For example, using the following images:

1st image: install all compile-time requirements, and install all needed python modules

2nd image: copy all compiled/built packages from the first image to the second, without the compilers themselves (gcc, postgers-dev, python-dev, etc..)

The final objective is to have a smaller image, running python and the python packages that I need.

In short: how can I ‘wrap’ all the compiled modules (site-packages / external libs) that were created in the first image, and copy them in a ‘clean’ manner, to the 2nd image.

Asked By: gCoh

||

Answers:

The docs on this explain exactly how to do this.

https://docs.docker.com/engine/userguide/eng-image/multistage-build/#before-multi-stage-builds

Basically you do exactly what you’ve said. The magic of multistage build feature though is that you can do this all from one dockerfile.

ie:

FROM golang:1.7.3
WORKDIR /go/src/github.com/alexellis/href-counter/
RUN go get -d -v golang.org/x/net/html  
COPY app.go .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .

FROM alpine:latest  
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=0 /go/src/github.com/alexellis/href-counter/app .
CMD ["./app"]  

This builds a go binary, then the next image runs the binary. The first image has all the build tools and the seccond is just a base linux machine that can run a binary.

Answered By: Pandelis

ok so my solution is using wheel, it lets us compile on first image, create wheel files for all dependencies and install them in the second image, without installing the compilers

FROM python:2.7-alpine as base

RUN mkdir /svc
COPY . /svc
WORKDIR /svc

RUN apk add --update 
    postgresql-dev 
    gcc 
    musl-dev 
    linux-headers

RUN pip install wheel && pip wheel . --wheel-dir=/svc/wheels

FROM python:2.7-alpine

COPY --from=base /svc /svc

WORKDIR /svc

RUN pip install --no-index --find-links=/svc/wheels -r requirements.txt

You can see my answer regarding this in the following blog post

https://www.blogfoobar.com/post/2018/02/10/python-and-docker-multistage-build

Answered By: gCoh

I recommend the approach detailed in this article (section 2). He uses virtualenv so pip install stores all the python code, binaries, etc. under one folder instead of spread out all over the file system. Then it’s easy to copy just that one folder to the final “production” image. In summary:

Compile image

  • Activate virtualenv in some path of your choosing.
  • Prepend that path to your docker ENV. This is all virtualenv needs to function for all future docker RUN and CMD action.
  • Install system dev packages and pip install xyz as usual.

Production image

  • Copy the virtualenv folder from the Compile Image.
  • Prepend the virtualenv folder to docker’s PATH
Answered By: mpoisot

This is a place where using a Python virtual environment inside Docker can be useful. Copying a virtual environment normally is tricky since it needs to be the exact same filesystem path on the exact same Python build, but in Docker you can guarantee that.

(This is the same basic recipe @mpoisot describes in their answer and it appears in other SO answers as well.)

Say you’re installing the psycopg PostgreSQL client library. The extended form of this requires the Python C development library plus the PostgreSQL C client library headers; but to run it you only need the PostgreSQL C runtime library. So here you can use a multi-stage build: the first stage installs the virtual environment using the full C toolchain, and the final stage copies the built virtual environment but only includes the minimum required libraries.

A typical Dockerfile could look like:

# Name the single Python image we're using everywhere.
ARG python=python:3.10-slim

# Build stage:
FROM ${python} AS build

# Install a full C toolchain and C build-time dependencies for
# everything we're going to need.
RUN apt-get update 
 && DEBIAN_FRONTEND=noninteractive 
    apt-get install --no-install-recommends --assume-yes 
      build-essential 
      libpq-dev

# Create the virtual environment.
RUN python3 -m venv /venv
ENV PATH=/venv/bin:$PATH

# Install the Python library dependencies, including those with
# C extensions.  They'll get installed into the virtual environment.
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

# Final stage:
FROM ${python}

# Install the runtime-only C library dependencies we need.
RUN apt-get update 
 && DEBIAN_FRONTEND=noninteractive 
    apt-get install --no-install-recommends --assume-yes 
      libpq5

# Copy the virtual environment from the first stage.
COPY --from=build /venv /venv
ENV PATH=/venv/bin:$PATH

# Copy the application in.
COPY . .
CMD ["./main.py"]

If your application uses a Python entry point script then you can do everything in the first stage: RUN pip install . will copy the application into the virtual environment and create a wrapper script in /venv/bin for you. In the final stage you don’t need to COPY the application again. Set the CMD to run the wrapper script out of the virtual environment, which is already at the front of the $PATH.

Again, note that this approach only works because it is the same Python base image in both stages, and because the virtual environment is on the exact same path. If it is a different Python or a different container path the transplanted virtual environment may not work correctly.

Answered By: David Maze