Linear Regression: How to find the distance between the points and the prediction line?

Question:

I’m looking to find the distance between the points and the prediction line. Ideally I would like the results to be displayed in a new column which contains the distance, called ‘Distance’.

My Imports:

import os.path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
%matplotlib inline 

Sample of my data:

idx  Exam Results  Hours Studied
0       93          8.232795
1       94          7.879095
2       92          6.972698
3       88          6.854017
4       91          6.043066
5       87          5.510013
6       89          5.509297

My code so far:

x = df['Hours Studied'].values[:,np.newaxis]
y = df['Exam Results'].values

model = LinearRegression()
model.fit(x, y)

plt.scatter(x, y,color='r')
plt.plot(x, model.predict(x),color='k')
plt.show()

My plot

Any help would be greatly appreciated. Thanks

Asked By: Mark Kennedy

||

Answers:

You simply need to assign the difference between y and model.predict(x) to a new column (or take absolute value if you just want the magnitude if the difference):

#df["Distance"] = abs(y - model.predict(x))  # if you only want magnitude
df["Distance"] = y - model.predict(x)
print(df)
#   Exam Results  Hours Studied  Distance
#0            93       8.232795 -0.478739
#1            94       7.879095  1.198511
#2            92       6.972698  0.934043
#3            88       6.854017 -2.838712
#4            91       6.043066  1.714063
#5            87       5.510013 -1.265269
#6            89       5.509297  0.736102

This is because your model predicts a y (dependent variable) for each independent variable (x). The x coordinates are the same, so the difference in y is the value you want.

Answered By: pault