constrained linear regression / quadratic programming python

Question:

I have a dataset like this:

import numpy as np

a = np.array([1.2, 2.3, 4.2])
b = np.array([1, 5, 6])
c = np.array([5.4, 6.2, 1.9])

m = np.vstack([a,b,c])
y = np.array([5.3, 0.9, 5.6])

and want to fit a constrained linear regression

y = b1*a + b2*b + b3*c

where all b’s sum to one and are positive: b1+b2+b3=1

A similar problem in R is specified here:

https://stats.stackexchange.com/questions/21565/how-do-i-fit-a-constrained-regression-in-r-so-that-coefficients-total-1

How can I do this in python?

Asked By: spore234

||

Answers:

EDIT:
These two approaches are very general and can work for small-medium scale instances. For a more efficient approach, check the answer of chthonicdaemon (using customized preprocessing and scipy’s optimize.nnls).

Using scipy

Code

import numpy as np
from scipy.optimize import minimize

a = np.array([1.2, 2.3, 4.2])
b = np.array([1, 5, 6])
c = np.array([5.4, 6.2, 1.9])

m = np.vstack([a,b,c])
y = np.array([5.3, 0.9, 5.6])

def loss(x):
    return np.sum(np.square((np.dot(x, m) - y)))

cons = ({'type': 'eq',
         'fun' : lambda x: np.sum(x) - 1.0})

x0 = np.zeros(m.shape[0])
res = minimize(loss, x0, method='SLSQP', constraints=cons,
               bounds=[(0, np.inf) for i in range(m.shape[0])], options={'disp': True})

print(res.x)
print(np.dot(res.x, m.T))
print(np.sum(np.square(np.dot(res.x, m) - y)))

Output

Optimization terminated successfully.    (Exit mode 0)
        Current function value: 18.817792344
        Iterations: 5
        Function evaluations: 26
        Gradient evaluations: 5
[ 0.7760881  0.         0.2239119]
[ 1.87173571  2.11955951  4.61630834]
18.817792344

Evaluation

  • It’s obvious that the model-capabilitites / model-complexity is not enough to obtain a good performance (high loss!)

Using general-purpose QP/SOCP-optimization modelled by cvxpy

Advantages:

  • cvxpy prooves, that the problem is convex
  • convergence for convex optimization problems is guaranteed (might be also true for the above)
  • in general: better accuracy
  • in general: more robust in regards to numerical instabilities (solver is only able to solve SOCPs; not non-convex models like the SLSQP-approach above!)

Code

import numpy as np
from cvxpy import *

a = np.array([1.2, 2.3, 4.2])
b = np.array([1, 5, 6])
c = np.array([5.4, 6.2, 1.9])

m = np.vstack([a,b,c])
y = np.array([5.3, 0.9, 5.6])

X = Variable(m.shape[0])
constraints = [X >= 0, sum_entries(X) == 1.0]

product = m.T * diag(X)
diff = sum_entries(product, axis=1) - y
problem = Problem(Minimize(norm(diff)), constraints)
problem.solve(verbose=True)

print(problem.value)
print(X.value)

Output

ECOS 2.0.4 - (C) embotech GmbH, Zurich Switzerland, 2012-15. Web: www.embotech.com/ECOS

It     pcost       dcost      gap   pres   dres    k/t    mu     step   sigma     IR    |   BT
 0  +0.000e+00  -0.000e+00  +2e+01  5e-01  1e-01  1e+00  4e+00    ---    ---    1  1  - |  -  - 
 1  +2.451e+00  +2.539e+00  +4e+00  1e-01  2e-02  2e-01  8e-01  0.8419  4e-02   2  2  2 |  0  0
 2  +4.301e+00  +4.306e+00  +2e-01  5e-03  7e-04  1e-02  4e-02  0.9619  1e-02   2  2  2 |  0  0
 3  +4.333e+00  +4.334e+00  +2e-02  4e-04  6e-05  1e-03  4e-03  0.9326  2e-02   2  1  2 |  0  0
 4  +4.338e+00  +4.338e+00  +5e-04  1e-05  2e-06  4e-05  1e-04  0.9698  1e-04   2  1  1 |  0  0
 5  +4.338e+00  +4.338e+00  +3e-05  8e-07  1e-07  3e-06  7e-06  0.9402  7e-03   2  1  1 |  0  0
 6  +4.338e+00  +4.338e+00  +7e-07  2e-08  2e-09  6e-08  2e-07  0.9796  1e-03   2  1  1 |  0  0
 7  +4.338e+00  +4.338e+00  +1e-07  3e-09  4e-10  1e-08  3e-08  0.8458  2e-02   2  1  1 |  0  0
 8  +4.338e+00  +4.338e+00  +7e-09  2e-10  2e-11  9e-10  2e-09  0.9839  5e-02   1  1  1 |  0  0

OPTIMAL (within feastol=1.7e-10, reltol=1.5e-09, abstol=6.5e-09).
Runtime: 0.000555 seconds.

4.337947939  # needs to be squared to be compared to scipy's output!
             #  as we are using l2-norm (outer sqrt) instead of sum-of-squares
             #  which is nicely converted to SOCP-form and easier to
             #  tackle by SOCP-based solvers like ECOS
             #  -> does not change the solution-vector x, only the obj-value
[[  7.76094262e-01]
 [  7.39698388e-10]
 [  2.23905737e-01]]
Answered By: sascha

You can get a good solution to this with a little bit of math and scipy.optimize.nnls:

First we do the math:

If

y = b1*a + b2*b + b3*c and b1 + b2 + b3 = 1, then b3 = 1 – b1 – b2.

If we substitute and simplify we end up with

y – c = b1(a – c) + b2(b – c)

Now, we don’t have any equality constraints and nnls can solve directly:

import scipy.optimize
A = np.vstack([a - c, b - c]).T
(b1, b2), norm = scipy.optimize.nnls(A, y - c)
b3 = 1 - b1 - b2

This recovers the same solution as obtained in the other answer using cvxpy.

b1 = 0.77608809648662802
b2 = 0.0
b3 = 0.22391190351337198
norm = 4.337947941595865

This approach can be generalised to an arbitrary number of dimensions as follows. Assume that we have a matrix B constructed with a, b, c from the original question arranged in the columns. Any additional dimensions will get added to this.

Now, we can do

A = B[:, :-1] - B[:, -1:]
bb, norm = scipy.optimize.nnls(A, y - B[:, -1])
bi = np.append(bb, 1 - sum(bb))
Answered By: chthonicdaemon

One comment regarding sascha’s scipy implementation: be aware that with scipy minimize, the trial-and-error nature of SLSQP may get you a solution that is slightly off unless you make some other specifications, namely the maximum iterations (maxiter) and maximum tolerance (ftol), as detailed in the scipy docs here.

The default values are: maxiter=100 and ftol=1e-06.

Here is an example to illustrate using matrix notation: first get rid of the constraints and bounds. Also assume for simplicity that the intercept=0. In that case, the coefficients for any multiple regression, as covered here on page 4, will be (precisely):

def betas(y, x):
    # y and x are ndarrays--response & design matrixes
    return np.dot(np.linalg.inv(np.dot(x.T, x)), np.dot(x.T, y))

Now, given that the goal of least squares regression is to minimize the sum of squared residuals, take sascha’s loss function (re-written slightly):

def resids(b, y, x):
    resid = y - np.dot(x, b)
    return np.dot(resid.T, resid)

Given your actual Y and X vectors, you can plug in the resulting “true” betas from the first equation above into the second to get a much better “benchmark.” Compare this benchmark to the .fun attribute of res (what scipy minimize spits out). Even tiny changes can cause meaningful changes to the resulting coefficients.

So to make a long story short, it will sacrifice speed but improve accuracy to use something like

options={'maxiter' : 1000, 'ftol' : 1e-07}

within sascha’s code.

Answered By: Brad Solomon

Your problem is a linear least squares, you could solve it directly with a quadratic programming solver using the solve_ls function in qpsolvers. Here is a snippet adapted from this post on linear regression in Python:

from qpsolvers import solve_ls

# Objective (|| R x - s ||^2): || [a b c] x - y ||^2
R = m.T
s = y

# Linear constraint (A * x == b): sum(x) = 1
A = np.ones((1, 3))
b = np.array([1.0])

# Box constraint (lb <= x): x >= 0
lb = np.zeros(3)

x = solve_ls(R, s, A=A, b=b, lb=lb, solver="quadprog")

On my machine this code finds the solution x = array([0.7760881, 0.0, 0.2239119]). I’ve uploaded the full code to constrained_linear_regression.py, feel free to try it out.

Answered By: Tastalian

thanks @sascha for the great answer! the cvxpy example there is pretty out of date, so thought I would provide a slightly different version based on their current API and with some light editing, for clarity:

import numpy as np
import cvxpy as cp

x1 = np.arange(1000)
x2 = np.random.normal(size=(1000,))
x = np.vstack([x1, x2]).T

y = np.random.random_sample(size=(1000,))

print("x shape", x.shape)
print("y shape", y.shape)

weights = cp.Variable(x.shape[1])

objective = cp.sum_squares(x @ weights - y)
minimize = cp.Minimize(objective)

constraints = [weights >= 0, cp.sum(weights) == 1.0]
problem = cp.Problem(minimize, constraints)

problem.solve(verbose=True)

print(problem.value)
print(weights.value)

I’ll add that cvxpy is actively managed and the team there seems very responsive.

Answered By: abe