High metrics value for linear regression model

Question:

I’m working with a dataset of cars, containing 10k rows and 5 columns:

id mileage_per_year model_year price sold

There are two negative values in the price column, I didn’t know if I should 0 it or just leave it, and so, I left it untouched. I don’t know if it can affect the training too much.

id 4200 price -270.77 mileage_per_year 17000 model_year 1998
id 4796 price -840.36 mileage_per_year 13277 model_year 1998

The max price in the dataset is 118,929.72 and the mean price is 64,842.

The challenge was to perform an exploratory analysis, change the dataset from imperial to metric system, translate from english to portuguese and create a model to predict the price of a car from 2005 and 172.095,3 total kilometers.

KM por ano = mileage_per_year
Ano do modelo = model_year
Vendido = sold

df['KM total'] = 1

novo = ['Indice', 'KM por ano', 'Ano do modelo', 'Preço', 'Vendido']
df = df.rename(columns={list(df.columns.values)[i] : novo[i] for i in range(len(novo))})

for i in range(len(df)):
    df['KM por ano'][i] = int(float(df["KM por ano"][i] * 1.60934))
    if df['Vendido'][i] == 'yes':
        df['Vendido'][i] = 'Sim'
    else:
        df['Vendido'][i] = 'Nao'

    df['KM total'][i] = int(df['KM por ano'][i] *  (2023 - df['Ano do modelo'][i]))

I created a new column containing the total KM the car has so it would be, in my mind, easier to train the model.

X = df[['KM por ano','KM total','Ano do modelo']]
y = df[['Preço']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

lr = LinearRegression()
lr.fit(X_train, y_train)
km_ano = 172095.3 / (2023-2005)
preds = lr.predict([[km_ano, 172095.3, 2005]])

The model predicts the price of 66,347 USD.

My problem is with the evaluation of the model.

MAE = 20,849.715
MSE(squared=False) = 24,571.520
RMSE(squared=False) = 156.753

At this point I thought the model was bogus, something was wrong. However the coefficient of determination score was encouraging (at least it seems so)

r2 = 0.04946931214392558

Am I doing something wrong? It felt pretty obvious to use multiple linear regression for this but with these metrics it feels like it’s wrong.

Sorry if the question isn’t clear, I tried my best to explain.

The main task here was to predict the price of a car with 172095.3 total km and model year from 2005. I think I did everything right but the metrics, as far as I know, should be around 0-1 however they are off the charts. Only r2 seems to corroborate the prediction, but is it enough to trust the model?

Asked By: Filipe Santos

||

Answers:

None of your average metrics will be between 0 and 1 if you’re trying to predict the price of a car. However, if you are trying to predict the size of an ant, you might say "I have a problem" with such values 🙂

The price you predict has an average absolute error of +/-$20k. So you can consider the confidence interval of your prediction between [151245.585, 192945.02]. The smaller your spread, the more confidence you can have in your prediction. The best is to use the MAPE (Mean Absolute Percentage Error) to get a percentage error. Probably here you will have an error of +/- 12%.

So with MAPE, you can compare the accuracy of two models: one for car price and other for ant size.

Answered By: Corralien