linear regression: my plotting doesn't show the line
Question:
I am working on implementing from scratch a linear regression model means without using Sklearn package.
all was working just fine , until i tried ploting the result.
i looked at a bunch of solution but neither of them was for myy problem
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv(r'C:Salary.csv')
x=data['Salary']
y=data['YearsExperience']
#y= mx+b
m = 0
b = 0
Learning_Rate = .01
epochs = 5000
n = np.float(x.shape[0])
error = []
for i in range(epochs):
Y_hat = m*x+b
#error
mse= (1/n)*np.sum((y-Y_hat)**2)
error.append(mse)
#gradient descend
db = (-2/n) * np.sum(x*(y-Y_hat))
dm = (-2/n) * np.sum((y-Y_hat))
m = m - Learning_Rate * dm
b = b - Learning_Rate * db
#tracing x and y line
x_line = np.linspace(0, 15, 100)
y_line = (m*x_line)+ b
#ploting result
plt.figure(figsize=(8,6))
plt.title('LR result')
**plt.plot(x_line,y_line) #the problem is apparently here
# i just don't know what to do**
plt.scatter(x,y)
plt.show()
appart from that, there is no problem with the code .
Answers:
The problem is not happening while plotting, the problem is with the parameters in plt.plot(x_line,y_line), I tested your code and found that y_line is all NaN values, double check the calculations (y_line, m, dm).
Your code has multiple problems:
-
you are plotting the line from 0
and 15
, while data range from about 40000
to 140000
. Even if you are correctly computing the line, you are going to plot it in a region far away from your data
-
in the loop there is a mistake in the computation of dm
and db
, they are swapped. The corrected expressions are:
dm = (-2/n)*np.sum(x*(y - Y_hat))
db = (-2/n)*np.sum((y - Y_hat))
-
your x
and y
data are on very different scales: x
is ~10⁴
magnitude, while y
is ~10¹
. For this reason, also m
and b
will likely be very different from each other (different orders of magnitude). This is the reason why you should use two different learning rate for the different quantities you are optimizing: Learning_Rate_m
for m
and Learning_Rate_b
for b
-
finally, the gradient descent method is strongly affected by the initial guess: it may lead to find local minima (fake solutions) in place of the global minima (true solution). For this reason, you should try with different initial guesses for m
and b
, possibly close to their estimated value:
m = 0
b = -2
Complete Code
import numpy as np
import matplotlib.pyplot as plt
N = 40
np.random.seed(42)
x = np.random.randint(low = 38000, high = 145000, size = N)
y = (13 - 1)/(140000 - 40000)*(x - 40000) + 1 + 0.5*np.random.randn(N)
# initial guess
m = 0
b = -2
Learning_Rate_m = 1e-10
Learning_Rate_b = 1e-2
epochs = 5000
n = np.float(x.shape[0])
error = []
for i in range(epochs):
Y_hat = m*x + b
mse = 1/n*np.sum((y - Y_hat)**2)
error.append(mse)
dm = -2/n*np.sum(x*(y - Y_hat))
db = -2/n*np.sum((y - Y_hat))
m = m - Learning_Rate_m*dm
b = b - Learning_Rate_b*db
x_line = np.linspace(x.min(), x.max(), 100)
y_line = (m*x_line) + b
plt.figure(figsize=(8,6))
plt.title('LR result')
plt.plot(x_line,y_line, 'red')
plt.scatter(x,y)
plt.show()
Plot
I am working on implementing from scratch a linear regression model means without using Sklearn package.
all was working just fine , until i tried ploting the result.
i looked at a bunch of solution but neither of them was for myy problem
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv(r'C:Salary.csv')
x=data['Salary']
y=data['YearsExperience']
#y= mx+b
m = 0
b = 0
Learning_Rate = .01
epochs = 5000
n = np.float(x.shape[0])
error = []
for i in range(epochs):
Y_hat = m*x+b
#error
mse= (1/n)*np.sum((y-Y_hat)**2)
error.append(mse)
#gradient descend
db = (-2/n) * np.sum(x*(y-Y_hat))
dm = (-2/n) * np.sum((y-Y_hat))
m = m - Learning_Rate * dm
b = b - Learning_Rate * db
#tracing x and y line
x_line = np.linspace(0, 15, 100)
y_line = (m*x_line)+ b
#ploting result
plt.figure(figsize=(8,6))
plt.title('LR result')
**plt.plot(x_line,y_line) #the problem is apparently here
# i just don't know what to do**
plt.scatter(x,y)
plt.show()
appart from that, there is no problem with the code .
The problem is not happening while plotting, the problem is with the parameters in plt.plot(x_line,y_line), I tested your code and found that y_line is all NaN values, double check the calculations (y_line, m, dm).
Your code has multiple problems:
-
you are plotting the line from
0
and15
, while data range from about40000
to140000
. Even if you are correctly computing the line, you are going to plot it in a region far away from your data -
in the loop there is a mistake in the computation of
dm
anddb
, they are swapped. The corrected expressions are:dm = (-2/n)*np.sum(x*(y - Y_hat)) db = (-2/n)*np.sum((y - Y_hat))
-
your
x
andy
data are on very different scales:x
is~10⁴
magnitude, whiley
is~10¹
. For this reason, alsom
andb
will likely be very different from each other (different orders of magnitude). This is the reason why you should use two different learning rate for the different quantities you are optimizing:Learning_Rate_m
form
andLearning_Rate_b
forb
-
finally, the gradient descent method is strongly affected by the initial guess: it may lead to find local minima (fake solutions) in place of the global minima (true solution). For this reason, you should try with different initial guesses for
m
andb
, possibly close to their estimated value:m = 0 b = -2
Complete Code
import numpy as np
import matplotlib.pyplot as plt
N = 40
np.random.seed(42)
x = np.random.randint(low = 38000, high = 145000, size = N)
y = (13 - 1)/(140000 - 40000)*(x - 40000) + 1 + 0.5*np.random.randn(N)
# initial guess
m = 0
b = -2
Learning_Rate_m = 1e-10
Learning_Rate_b = 1e-2
epochs = 5000
n = np.float(x.shape[0])
error = []
for i in range(epochs):
Y_hat = m*x + b
mse = 1/n*np.sum((y - Y_hat)**2)
error.append(mse)
dm = -2/n*np.sum(x*(y - Y_hat))
db = -2/n*np.sum((y - Y_hat))
m = m - Learning_Rate_m*dm
b = b - Learning_Rate_b*db
x_line = np.linspace(x.min(), x.max(), 100)
y_line = (m*x_line) + b
plt.figure(figsize=(8,6))
plt.title('LR result')
plt.plot(x_line,y_line, 'red')
plt.scatter(x,y)
plt.show()