How to merge predicted values to original pandas test data frame where X_test has been converted using CountVectorizer before splitting

Question:

I want to merge my predicted results of my test data to my X_test. I was able to merge it with y_test but since my X_test is a corpus I’m not sure how I can identify the indexes to merge.
My codes are as below

def lr_model(df):

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    import pandas as pd
   
    # Create corpus as a list
    corpus = df['text'].tolist()
    cv = CountVectorizer()
    X = cv.fit_transform(corpus).toarray()
    y = df.iloc[:, -1].values

    # Splitting to testing and training
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

    # Train Logistic Regression on Training set
    classifier = LogisticRegression(random_state = 0)
    classifier.fit(X_train, y_train)

    # Predicting the Test set results
    y_pred = classifier.predict(X_test)

    # Merge true vs predicted labels
    true_vs_pred = pd.DataFrame(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

    return true_vs_pred

This gives me the y_test and y_pred but I’m not sure how I can add the X_test as an original data frame (the ids of the X_test) to this.
Any guidance is much appreciated. Thanks

Asked By: Jessie

||

Answers:

Using a pipeline can help you link the original X_test with the prediction:

def lr_model(df):

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    import pandas as pd
    from sklearn.pipeline import Pipeline

    # Defining X and y
    cv = CountVectorizer()
    X = df['text']
    y = df.iloc[:, -1].values

    # Splitting to testing and training
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

    # Create a pipeline
    pipeline = Pipeline([
        ('CountVectorizer', cv),
        ('LogisticRegression', LogisticRegression(random_state = 0)),
    ])

    # Train pipeline on Training set
    pipeline.fit(X_train, y_train)

    # Predicting the Test set results
    y_pred = pipeline.predict(X_test)

    return  X_test, y_test, y_pred 
Answered By: Mattravel
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.