In Scipy how and why does curve_fit calculate the covariance of the parameter estimates

Question

I have been using scipy.optimize.leastsq to fit some data. I would like to get some confidence intervals on these estimates so I look into the cov_x output but the documentation is very unclear as to what this is and how to get the covariance matrix for my parameters from this.

First of all it says that it is a Jacobian, but in the notes it also says that “cov_x is a Jacobian approximation to the Hessian” so that it is not actually a Jacobian but a Hessian using some approximation from the Jacobian. Which of these statements is correct?

Secondly this sentence to me is confusing:

This matrix must be multiplied by the residual variance to get the covariance of the parameter estimates – see curve_fit.

I indeed go look at the source code for curve_fit where they do:

s_sq = (func(popt, *args)**2).sum()/(len(ydata)-len(p0))
pcov = pcov * s_sq

which corresponds to multiplying cov_x by s_sq but I cannot find this equation in any reference. Can someone explain why this equation is correct?
My intuition tells me that it should be the other way around since cov_x is supposed to be a derivative (Jacobian or Hessian) so I was thinking:
cov_x * covariance(parameters) = sum of errors(residuals) where sigma(parameters) is the thing I want.

How do I connect the thing curve_fit is doing with what I see at eg. wikipedia:
http://en.wikipedia.org/wiki/Propagation_of_uncertainty#Non-linear_combinations

Asked By: HansHarhoff

||

Source

Answer 1

OK, I think I found the answer. First the solution:
cov_x*s_sq is simply the covariance of the parameters which is what you want. Taking sqrt of the diagonal elements will give you standard deviation (but be careful about covariances!).

Residual variance = reduced chi square = s_sq = sum[(f(x)-y)^2]/(N-n), where N is number of data points and n is the number of fitting parameters. Reduced chi square.

The reason for my confusion is that cov_x as given by leastsq is not actually what is called cov(x) in other places rather it is the reduced cov(x) or fractional cov(x). The reason it does not show up in any of the other references is that it is a simple rescaling which is useful in numerical computations, but is not relevant for a textbook.

About Hessian versus Jacobian, the documentation is poorly worded. It is the Hessian that is calculated in both cases as is obvious since the Jacobian is zero at a minimum. What they mean is that they are using an approximation to the Jacobian to find the Hessian.

A further note. It seems that the curve_fit result does not actually account for the absolute size of the errors, but only take into account the relative size of the sigmas provided. This means that the pcov returned doesn’t change even if the errorbars change by a factor of a million. This is of course not right, but seems to be standard practice ie. Matlab does the same thing when using their Curve fitting toolbox. The correct procedure is described here: https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#Parameter_errors_and_correlation

It seems fairly straightforward to do this once the optimum has been found, at least for Linear Least squares.

Answered By: HansHarhoff

Answer 2

I found this solution during my search for a similar question, and I have only a small improvement on HansHarhoff’s answer. The full output from leastsq provides a return value infodict, which contains infodict[‘fvec’] = f(x) -y. Thus, to compute the reduced chi square = (in the above notation)

s_sq = (infodict['fvec']**2).sum()/ (N-n)

BTW. Thanks HansHarhoff for doing most of the heavy lifting to solve this.

Answered By: Jim Parker

Answer 3

Math

First we start with linear regression. In many statistical problems, we assume the variables have some underlying distributions with some unknown parameters and we estimate these parameters. In linear regression, we assume the dependent variables y_i have a linear relationship with the independent variables x_ij:

y_i = x_i1β₁ + … + x_ipβ_p + σε_i, i = 1, …, n.

where ε_i has independent standard normal distribution, β_j‘s are p unknown parameters and σ is also unknown. We can write this in matrix form:

Y = X β + σε,

where Y, β, and ε is a column vector. To find the best β, we minimize the sum of the squares

S = (Y – X β)^T (Y – X β).

I just write out the solution, which is

β^ = (X^T X)^-1 X^T Y.

If we see Y as the specific observed data, β^ is the estimation of β under that observation. On the other hand, if we see Y as the random variable, the estimator β^ becomes a random variable too. In this way, we can see what the covariance of β^ is.

Because Y has a multivariate normal distribution and β^ is a linear transformation of Y, β^ has also a multivariate normal distribution. The covariance matrix of β^ is

Cov(β^) = (X^T X)^-1 X^T Cov(Y) ((X^T X)^-1 X^T)^T = (X ^T X)^-1 σ².

But here σ is unknown, so we also need to estimate it. If we let

Q = (Y – X β^)^T (Y – X β^),

it can be proved that Q / σ² has the chi-square distribution with n – p degrees of freedom (moreover, Q is independent of β^). This makes

σ^² = Q / (n – p)

an unbiased estimator of σ². So the final estimator of Cov(β^) is

(X^T X)^-1 Q / (n – p).

SciPy API

curve_fit is the most convenient, the second return value pcov is just the estimation of the covariance of β^, that is the final result (X^T X)^-1 Q / (n – p) above.

In leastsq, the second return value cov_x is (X^T X)^-1. From the expression of S, we see X^T X is the Hessian of S (half of the Hessian, to be precise), that’s why the document says cov_x is the inverse of the Hessian. To get the covariance, you need to multiply cov_x with Q / (n – p).

Non-Linear Regression

In non-linear regression, y_i depend on the parameters non-linearly:

y_i = f(x_i, β₁, … , β_p) + σε_i.

We can calculate the partial derivatives of f with respect to β_j, so it becomes approximately linear. Then the calculation is basically the same as linear regression except we need to approximate the minimum iteratively. In practice, the algorithm can be some more sophisticated one such as the Levenberg–Marquardt algorithm which is the default of curve_fit.

More About Providing Sigma

This section is about the sigma and absolute_sigma parameter in curve_fit. For basic usage of curve_fit when you have no prior knowledge about the covariance of Y, you can ignore this section.

Absolute Sigma

In linear regression above, the variance of y_i is σ and is unknown. If you know the variance. You can provide it to curve_fit through the sigma parameter and set absolute_sigma=True.

Suppose your provided sigma matrix is Σ. i.e.

Y ~ N(X β, Σ).

Y has the multivariate normal distribution with mean X β and covariance Σ. We want to maximize the likelihood of Y. From probability density function of Y, that is equivalent to minimize

S = (Y – X β)^T Σ^-1 (Y – X β).

The solution is

β^ = (X^T Σ^-1 X)^-1 X^T Σ^-1 Y.

And

Cov(β^) = (X^T Σ^-1 X)^-1.

The β^ and Cov(β^) above are the return values of curve_fit with absolute_sigma=True.

Relative Sigma

In some cases, you don’t know the exact variance of y_i, but you know the relative relationship between different y_i, for example the variance of y₂ is 4 times the variance of y₁. Then you can pass sigma and set absolute_sigma=False.

This time

Y ~ N(X β, Σσ)

with a known matrix Σ provided and an unknown number σ. The objective function to minimize is the same as absolute sigma since σ is a constant, and thus the estimator β^ is the same. But the covariance

Cov(β^) = (X^T Σ^-1 X)^-1 σ²,

has the unknown σ in it. To estimate σ, let

Q = (Y – X β^)^T Σ^-1 (Y – X β^).

Again, Q / σ² has the chi-square distribution with n – p degrees of freedom.

The estimation of Cov(β^) is

(X^T Σ^-1 X)^-1 Q / (n – p).

And this is the second return value of curve_fit with absolute_sigma=False.

Answered By: Cosyn