How to correctly plot a linear regression on a log10 scale?

Question:

I am plotting two lists of data against each other, namely freq and data. Freq stands for frequency, and data are the numeric observations for each frequency.

In the next step, I apply the ordinary linear least-squared regression between freq and data, using stats.linregress on the logarithmic scale. My aim is applying the linear regression inside the log-log scale, not on the normal scale.

Before doing so, I transform both freq and data into np.log10, since I plan to plot a straight linear regression line on the logarithmic scale, using plt.loglog.

Problem:
The problem is that the regression line, plotted in red color, is plotted far from the actual data, plotted in green color. I assume that there is a problem in combination with plt.loglog in my code, hence the visual distance between the green data and the red regression line. How can I fix this problem, so that the regression line plots on top of the actual data?

Here is my reproducible code:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Data
freq = [0.0102539, 0.0107422, 0.0112305, 0.0117188, 0.012207, 0.0126953,
        0.0131836]
data = [4.48575,  4.11893,  3.69591,  3.34766,  3.18452,  3.23554,  3.43357]

# Plot log10 of freq vs. data
plt.loglog(freq, data, c="green")

# Linear regression
log_freq = np.log10(freq)
log_data = np.log10(data)

reg = stats.linregress(log_freq, log_data)
slope = reg[0]
intercept = reg[1]

plt.plot(freq, slope*log_freq + intercept, color="red")

And here is a screenshot of the code’s result:
enter image description here

Asked By: Philipp

||

Answers:

You can convert your data sets to log base 10 first, then do linear regression and plot them accordingly.

Note that after the log transformation, the numbers inlog_freq will all be negative; therefore x-axis cannot be log-scaled.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Data
freq = np.array([0.0102539, 0.0107422, 0.0112305, 0.0117188, 0.012207, 0.0126953,
                 0.0131836])
data = np.array([4.48575, 4.11893, 3.69591, 3.34766, 3.18452, 3.23554, 3.43357])

# transform date to log base 10
log_freq = np.log10(freq)
log_data = np.log10(data)

# Plot freq vs. data
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(log_freq, log_data, c="green", label='Original data (log 10 base)')

# Linear regression
reg = stats.linregress(log_freq, log_data)

# Plot fitted freq vs. data
ax.plot(log_freq, reg.slope * log_freq + reg.intercept, color="red",
        label='Fitted line on the original data (log 10 base)')

plt.legend()
plt.tight_layout()
plt.show()

enter image description here

References:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

https://numpy.org/doc/stable/reference/generated/numpy.log10.html#

Answered By: コリン

enter image description here

First of all, I question the necessity of log-log axes, because the ranges of the data, or at least the ranges of the data that you’ve shown us, are limited on both coordinates.

In the code below, I have

  • computed the logarithms in base 10 of your arrays,
  • used the formulas for linear regression but using the logarithms of data to obtain the equation of a straight line:
                     y = a + b·x
    in, so to say, the logarithmic space.

Because a straight line in log-space corresponds, in data-space, to a power law, y = pow(10, a)·pow(x, b), I have plotted

  • the original data, in log-log, and
  • the power law, also in log-log,

obtaining a straight line in the log-log representation.

import matplotlib.pyplot as plt
from math import log10
freq = [.0102539, .0107422, .0112305, .0117188, .012207, .0126953, .0131836]
data = [4.48575, 4.11893, 3.69591, 3.34766, 3.18452, 3.23554, 3.43357]
n = len(freq)

# the following block of code is the unfolding of the formulas in
# https://mathworld.wolfram.com/LeastSquaresFittingPowerLaw.html

# START ##############################################    
lx, ly = [[log10(V) for V in v] for v in (freq, data)]
sum_x = sum(x for x in lx)
sum_y = sum(y for y in ly)
sum_x2 = sum(x**2 for x in lx)
sum_y2 = sum(y**2 for y in ly)
sum_xy = sum(x*y for x, y in zip(lx, ly))

# coefficients of a straight line "y = a + b x" in log-log space
b = (n*sum_xy - sum_x*sum_y)/(n*sum_x2-sum_x**2)
a = (sum_y - b*sum_x)/n
A = pow(10, a)
# END  ##############################################    

plt.loglog(freq, data)
plt.loglog(freq, [A*pow(x, b) for x in freq])
Answered By: gboffi