scipy curve_fit incorrect for large X values

Question:

To determine trends over time, I use scipy curve_fit with X values from time.time(), for example 1663847528.7147126 (1.6 billion).
Doing a linear interpolation sometimes creates erroneous results, and providing approximate initial p0 values doesn’t help. I found the magnitude of X to be a crucial element for this error and I wonder why?

Here is a simple snippet that shows working and non-working X offset:

import scipy.optimize

def fit_func(x, a, b):
    return a + b * x

y = list(range(5))

x = [1e8 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0]))
# Result is correct:
#   (array([-1.e+08,  1.e+00]), array([[ 0., -0.],
#          [-0.,  0.]]))

x = [1e9 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.0]))
# Result is not correct:
#   OptimizeWarning: Covariance of the parameters could not be estimated
#   warnings.warn('Covariance of the parameters could not be estimated',
#   (array([-4.53788811e+08,  4.53788812e-01]), array([[inf, inf],
#          [inf, inf]]))

Almost perfect p0 for b removes the warning but still curve_fit doesn't work
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.99]))
# Result is not correct:
#   (array([-7.60846335e+10,  7.60846334e+01]), array([[-1.97051972e+19,  1.97051970e+10],
#          [ 1.97051970e+10, -1.97051968e+01]]))
   
# ...but perfect p0 works
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 1.0]))
#(array([-1.e+09,  1.e+00]), array([[inf, inf],
#       [inf, inf]]))

As a side question, perhaps there’s a more efficient method for a linear fit? Sometimes I want to find the second-order polynomial fit, though.

Tested with Python 3.9.6 and SciPy 1.7.1 under Windows 10.

Asked By: Anders Petersson

||

Answers:

If you just need to compute a linear fit, I believe curve_fit is not necessary and I would just use the linregress function instead from SciPy as well:

>>> from scipy import stats

>>> y = list(range(5))

>>> x = [1e8 + a for a in range(5)]
>>> stats.linregress(x, y)
LinregressResult(slope=1.0, intercept=-100000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)

>>> x2 = [1e9 + a for a in range(5)]
>>> stats.linregress(x2, y)
LinregressResult(slope=1.0, intercept=-1000000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)

In general, if you need a polynomial fit I would use NumPy polyfit.

Answered By: blunova

Root cause

You are facing two problems:

  • Fitting procedure are scale sensitive. It means chosen units on a specific variable (eg. µA instead of kA) can artificially prevent an algorithm to converge properly (eg. One variable is several order of magnitude bigger than another and dominate the regression);
  • Float Arithmetic Error. When switching from 1e8 to 1e9 you just hit the magnitude when such a kind of error become predominant.

The second one is very important to realize. Let’s say you are limited to 8 significant digits representation, then 1 000 000 000 and 1 000 000 001 are the same numbers as they are both limited to this writing 1.0000000e9 and we cannot accurately represents 1.0000000_e9 which requires one more digit (_). This is why your second example fails.

Additionally you are using an Non Linear Least Square algorithm to solve a Linear Least Square problem, and this is also somehow related to your problem.

You have three solutions:

  • Normalize;
  • Normalize and change the methodology/algorithm;
  • Increase the machine precision.

I’ll choose the first one as it is more generic, the second one has been proposed by @blunova and totally makes sense, the latter is probably an inherent limitation.

Normalization

To mitigate both problems, a common solution is normalization. In your case a simple standardization is enough:

import numpy as np
import scipy.optimize

y = np.arange(5)
x = 1e9 + y

def fit_func(x, a, b):
    return a + b * x

xm = np.mean(x)         # 1000000002.0
xs = np.std(x)          # 1.4142135623730951

result = scipy.optimize.curve_fit(fit_func, (x - xm)/xs, y)

# (array([2.        , 1.41421356]),
# array([[0., 0.],
#        [0., 0.]]))

# Back transformation:
a = result[0][1]/xs                    # 1.0
b = result[0][0] - xm*result[0][1]/xs  # -1000000000.0

Or the same result using sklearn interface:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("regressor", LinearRegression())
])

pipe.fit(x.reshape(-1, 1), y)

pipe.named_steps["scaler"].mean_          # array([1.e+09])
pipe.named_steps["scaler"].scale_         # array([1.41421356])
pipe.named_steps["regressor"].coef_       # array([1.41421356])
pipe.named_steps["regressor"].intercept_  # 2.0

Back transformation

Indeed when normalizing the fit result is then expressed in term of normalized variable. To get the required fit parameters, you just need to do a bit of math to convert back the regressed parameters into the original variable scales.

Simply write down and solve the transformation:

 y = x'*a' + b'
x' = (x - m)/s
 y = x*a + b

Which gives you the following solution:

a = a'/s
b = b' - m/s*a'

Precision addendum

Numpy default float precision is float64 as you expected and has about 15 significant digits:

x.dtype                            # dtype('float64')
np.finfo(np.float64).precision     # 15

But scipy.curve_fit relies on scipy.least_square which makes use of a squared metric to drive the optimization.

Without digging into the details I suspect this is where the problem happens, when dealing with values that are all close to 1e9 you reach the threshold where Float Arithmetic Error becomes predominant.

So this threshold of 1e9 you have hit is not related to the distinction between numbers on your variable x (float64 has sufficient precision to make it almost exactly different) but on the usage that is made of it when solving:

minimize F(x) = 0.5 * sum(rho(f_i(x)**2), i = 0, ..., m - 1)
subject to lb <= x <= ub`

You can also check that in its signature, tolerances are about 8 decades wide:

scipy.optimize.least_squares(fun, x0, jac='2-point', bounds=(- inf, inf),
    method='trf', ftol=1e-08, xtol=1e-08, gtol=1e-08, x_scale=1.0,
    loss='linear', f_scale=1.0, diff_step=None, tr_solver=None, 
    tr_options={}, jac_sparsity=None, max_nfev=None, verbose=0,
    args=(), kwargs={})

Which may let you tweak the algorithm to add extra steps before convergence is reached (if so) but that will not replace or beat the usefulness of normalization.

Methods comparison

What is interesting with scipy.stats.linregress method is the scale tolerance which is handled by design. The method uses variable normalization and pure linear algebra and numerical stability trick (see the TINY variable) to solve the LS problem even in problematic conditions.

This of course contrasts with the scipy.optimize.curve_fit method which is a NLLS solver implemented as an optimized gradient descent algorithm (see Levenberg–Marquardt algorithm).

If you stick with linear least square problems (linear in terms of parameters not variables, so second order polynomial is LLS) then LLS is probably a simpler option to chose as it handles normalization for you.

Answered By: jlandercy