Merging results from model.predict() with original pandas DataFrame?

Question:

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df['class'] = data.target

X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

To merge these predictions back with the original df, I try this:

df['y_hats'] = y_hats

But that raises:

ValueError: Length of values does not match length of index

I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I’d like to just fill the rows included in train with np.nan values in the dataframe.

Asked By: blacksite

||

Answers:

your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you’re happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:

y_hats2 = model.predict(X)

df['y_hats'] = y_hats2

EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df_class = pd.DataFrame(data = data.target)

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
Answered By: flyingmeatball

you can also use

y_hats = model.predict(X)

df['y_hats'] = y_hats.reset_index()['name of the target column']
Answered By: ambar003

You can probably make a new dataframe and add to it the test data along with the predicted values:

data['y_hats'] = y_hats
data.to_csv('data1.csv')
Answered By: Nidhi Garg

I have the same problem (almost)

I fixed it this way

...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

y_hats = model.predict(X_test)

y_hats  = pd.DataFrame(y_hats)

df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]


y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
Answered By: asmgx

You can create a y_hat dataframe copying indices from X_test then merge with the original data.

y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)

Note, left join will include train data rows. Omitting ‘how’ parameter will result in just test data.

Answered By: Adam Milecki

Try this:

y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2
Answered By: PATRICK KANYI
predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'], 
                            index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True, 
                 right_index=True)
Answered By: Reshma2k

This worked well for me. It maintains the indexing positions.

pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class  = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)
Answered By: user115916

Here is a solution that worked for me:

It consists of building, for each of your folds/iterations, one dataframe which includes observed and predicted values for your test set; this way, you make use of the index (ID) contained in y_true, which should correspond to your subjects’ IDs (in my code: ‘SubjID’).

You then concatenate the DataFrames that you generated (through 5 folds of test data in my case) and paste them back into your original dataset.

I hope this helps!

FoldNr = 0
for train_index, test_index in skf.split(X, y):
    FoldNr = FoldNr + 1
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # [...] your model

    # performance is measured on test set
    y_true, y_pred = y_test, clf.predict(X_test)

    # Save predicted values for each test set
    a = pd.DataFrame(y_true).reset_index()
    b = pd.Series(y_pred, name = 'y_pred')
    globals()['ObsPred_df' + str(FoldNr)] = a.join(b)
    globals()['ObsPred_df' + str(FoldNr)].set_index('SubjID', inplace=True)

# Create dataframe with observed and predicted values for all subjects
ObsPred_Concat = pd.concat([ObsPred_df1, ObsPred_df2, ObsPred_df3, ObsPred_df4, ObsPred_df5])

original_df['y_pred'] = ObsPred_Concat['y_pred']
Answered By: IreneF

First you need to convert y_val or y_test data into the DataFrame.

compare_df = pd.DataFrame(y_val)

then just create a new column with predicted data.

compare_df['predicted_res'] = y_pred_val

After that, you can easily filter the data that shows you which data is matching with original prediction based on a simple condition.

test_df = compare_df[compare_df['y_val'] == compare_df['predicted_res'] ]
Answered By: Kushal Bhavsar
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.