# scipy curve_fit incorrect for large X values

## Question:

To determine trends over time, I use `scipy curve_fit`

with X values from `time.time()`

, for example `1663847528.7147126`

(1.6 billion).

Doing a linear interpolation sometimes creates erroneous results, and providing approximate initial `p0`

values doesn’t help. I found the magnitude of X to be a crucial element for this error and I wonder why?

Here is a simple snippet that shows working and non-working X offset:

```
import scipy.optimize
def fit_func(x, a, b):
return a + b * x
y = list(range(5))
x = [1e8 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0]))
# Result is correct:
# (array([-1.e+08, 1.e+00]), array([[ 0., -0.],
# [-0., 0.]]))
x = [1e9 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.0]))
# Result is not correct:
# OptimizeWarning: Covariance of the parameters could not be estimated
# warnings.warn('Covariance of the parameters could not be estimated',
# (array([-4.53788811e+08, 4.53788812e-01]), array([[inf, inf],
# [inf, inf]]))
Almost perfect p0 for b removes the warning but still curve_fit doesn't work
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.99]))
# Result is not correct:
# (array([-7.60846335e+10, 7.60846334e+01]), array([[-1.97051972e+19, 1.97051970e+10],
# [ 1.97051970e+10, -1.97051968e+01]]))
# ...but perfect p0 works
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 1.0]))
#(array([-1.e+09, 1.e+00]), array([[inf, inf],
# [inf, inf]]))
```

As a side question, perhaps there’s a more efficient method for a linear fit? Sometimes I want to find the second-order polynomial fit, though.

Tested with Python 3.9.6 and SciPy 1.7.1 under Windows 10.

## Answers:

If you just need to compute a linear fit, I believe `curve_fit`

is not necessary and I would just use the `linregress`

function instead from SciPy as well:

```
>>> from scipy import stats
>>> y = list(range(5))
>>> x = [1e8 + a for a in range(5)]
>>> stats.linregress(x, y)
LinregressResult(slope=1.0, intercept=-100000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)
>>> x2 = [1e9 + a for a in range(5)]
>>> stats.linregress(x2, y)
LinregressResult(slope=1.0, intercept=-1000000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)
```

In general, if you need a polynomial fit I would use NumPy polyfit.

### Root cause

You are facing two problems:

- Fitting procedure are scale sensitive. It means chosen units on a specific variable (eg. µA instead of kA) can artificially prevent an algorithm to converge properly (eg. One variable is several order of magnitude bigger than another and dominate the regression);
- Float Arithmetic Error. When switching from
`1e8`

to`1e9`

you just hit the magnitude when such a kind of error become predominant.

The second one is very important to realize. Let’s say you are limited to 8 significant digits representation, then `1 000 000 000`

and `1 000 000 001`

are the same numbers as they are both limited to this writing `1.0000000e9`

and we cannot accurately represents `1.0000000_e9`

which requires one more digit (`_`

). This is why your second example fails.

Additionally you are using an Non Linear Least Square algorithm to solve a Linear Least Square problem, and this is also somehow related to your problem.

You have three solutions:

- Normalize;
- Normalize and change the methodology/algorithm;
- Increase the machine precision.

I’ll choose the first one as it is more generic, the second one has been proposed by `@blunova`

and totally makes sense, the latter is probably an inherent limitation.

### Normalization

To mitigate both problems, a common solution is normalization. In your case a simple standardization is enough:

```
import numpy as np
import scipy.optimize
y = np.arange(5)
x = 1e9 + y
def fit_func(x, a, b):
return a + b * x
xm = np.mean(x) # 1000000002.0
xs = np.std(x) # 1.4142135623730951
result = scipy.optimize.curve_fit(fit_func, (x - xm)/xs, y)
# (array([2. , 1.41421356]),
# array([[0., 0.],
# [0., 0.]]))
# Back transformation:
a = result[0][1]/xs # 1.0
b = result[0][0] - xm*result[0][1]/xs # -1000000000.0
```

Or the same result using `sklearn`

interface:

```
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
pipe = Pipeline([
("scaler", StandardScaler()),
("regressor", LinearRegression())
])
pipe.fit(x.reshape(-1, 1), y)
pipe.named_steps["scaler"].mean_ # array([1.e+09])
pipe.named_steps["scaler"].scale_ # array([1.41421356])
pipe.named_steps["regressor"].coef_ # array([1.41421356])
pipe.named_steps["regressor"].intercept_ # 2.0
```

### Back transformation

Indeed when normalizing the fit result is then expressed in term of normalized variable. To get the required fit parameters, you just need to do a bit of math to convert back the regressed parameters into the original variable scales.

Simply write down and solve the transformation:

```
y = x'*a' + b'
x' = (x - m)/s
y = x*a + b
```

Which gives you the following solution:

```
a = a'/s
b = b' - m/s*a'
```

### Precision addendum

Numpy default float precision is `float64`

as you expected and has about 15 significant digits:

```
x.dtype # dtype('float64')
np.finfo(np.float64).precision # 15
```

But `scipy.curve_fit`

relies on `scipy.least_square`

which makes use of a squared metric to drive the optimization.

Without digging into the details I suspect this is where the problem happens, when dealing with values that are all close to `1e9`

you reach the threshold where Float Arithmetic Error becomes predominant.

So this threshold of `1e9`

you have hit is not related to the distinction between numbers on your variable `x`

(`float64`

has sufficient precision to make it almost exactly different) but on the usage that is made of it when solving:

```
minimize F(x) = 0.5 * sum(rho(f_i(x)**2), i = 0, ..., m - 1)
subject to lb <= x <= ub`
```

You can also check that in its signature, tolerances are about 8 decades wide:

```
scipy.optimize.least_squares(fun, x0, jac='2-point', bounds=(- inf, inf),
method='trf', ftol=1e-08, xtol=1e-08, gtol=1e-08, x_scale=1.0,
loss='linear', f_scale=1.0, diff_step=None, tr_solver=None,
tr_options={}, jac_sparsity=None, max_nfev=None, verbose=0,
args=(), kwargs={})
```

Which may let you tweak the algorithm to add extra steps before convergence is reached (if so) but that will not replace or beat the usefulness of normalization.

### Methods comparison

What is interesting with `scipy.stats.linregress`

method is the scale tolerance which is handled by design. The method uses variable normalization and pure linear algebra and numerical stability trick (see the `TINY`

variable) to solve the LS problem even in problematic conditions.

This of course contrasts with the `scipy.optimize.curve_fit`

method which is a NLLS solver implemented as an optimized gradient descent algorithm (see Levenberg–Marquardt algorithm).

If you stick with linear least square problems (linear in terms of parameters not variables, so second order polynomial is LLS) then LLS is probably a simpler option to chose as it handles normalization for you.