Logistic regression and cross-validation

Question:

I am trying to solve a classification problem on a given dataset, through logistic regression (and this is not the problem). To avoid overfitting I’m trying to implement it through cross-validation (and here’s the problem): there’s something that I’m missing to complete the program. My purpose here is to determine accuracy.

But let me be specific. This is what I’ve done:

  1. I split the set into train set and test set
  2. I defined the logregression prediction model to be used
  3. I used the cross_val_predict method (in sklearn.cross_validation) to make predictions
  4. Lastly, I measured accuracy

Here is the code:

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cross_validation import train_test_split
from sklearn import metrics, cross_validation
from sklearn.linear_model import LogisticRegression
 
# read training data in pandas dataframe
data = pd.read_csv("./dataset.csv", delimiter=';')
# last column is target, store in array t
t = data['TARGET']
# list of features, including target
features = data.columns
# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)

# define method
logreg=LogisticRegression()

# cross valitadion prediction
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
print(metrics.accuracy_score(t_train, predicted)) 

My problems:

  • From what I understand the test set should not be considered until the very end and cross-validation should be made on training set. That’s why I inserted X_train and t_train in the cross_val_predict method. Thuogh, I get an error saying:

    ValueError: Found input variables with inconsistent numbers of samples: [6016, 4812]

    where 6016 is the number of samples in the whole dataset, and 4812 is the number of samples in the training set after the dataset has been split

  • After this, I don’t know what to do. I mean: when do the X_test and t_test come into play? I don’t get how I should use them after cross-validating and how to get the final accuracy.

Bonus question: I’d also like to perform scaling and reduction of dimensionality (through feature selection or PCA) within each step of the cross-validation. How can I do this? I’ve seen that defining a pipeline can help with scaling, but I don’t know how to apply this to the second problem.

Asked By: Harnak

||

Answers:

Here is working code tested on a sample dataframe. The first issue in your code is the target array is not an np.array. You also shouldn’t have target data in your features. Below I illustrate how to manually split the training and testing data using train_test_split. I also show how to use the wrapper cross_val_score to automatically split, fit, and score.

random.seed(42)
# Create example df with alphabetic col names.
alphabet_cols = list(string.ascii_uppercase)[:26]
df = pd.DataFrame(np.random.randint(1000, size=(1000, 26)),
                  columns=alphabet_cols)
df['Target'] = df['A']
df.drop(['A'], axis=1, inplace=True)
print(df.head())
y = df.Target.values  # df['Target'] is not an np.array.
feature_cols = [i for i in list(df.columns) if i != 'Target']
X = df.ix[:, feature_cols].as_matrix()
# Illustrated here for manual splitting of training and testing data.
X_train, X_test, y_train, y_test = 
    model_selection.train_test_split(X, y, test_size=0.2, random_state=0)

# Initialize model.
logreg = linear_model.LinearRegression()

# Use cross_val_score to automatically split, fit, and score.
scores = model_selection.cross_val_score(logreg, X, y, cv=10)
print(scores)
print('average score: {}'.format(scores.mean()))

Output

     B    C    D    E    F    G    H    I    J    K   ...    Target
0   20   33  451    0  420  657  954  156  200  935   ...    253
1  427  533  801  183  894  822  303  623  455  668   ...    421
2  148  681  339  450  376  482  834   90   82  684   ...    903
3  289  612  472  105  515  845  752  389  532  306   ...    639
4  556  103  132  823  149  974  161  632  153  782   ...    347

[5 rows x 26 columns]
[-0.0367 -0.0874 -0.0094 -0.0469 -0.0279 -0.0694 -0.1002 -0.0399  0.0328
 -0.0409]
average score: -0.04258093018969249

Helpful references:

Answered By: user4322543

Please look at the documentation of cross-validation at scikit to understand it more.

Also you are using cross_val_predict incorrectly. What it will do is internally call the cv you supplied (cv=10) to split the supplied data (i.e. X_train, t_train in your case) into again train and test, fit the estimator on train and predict on data which remains in test.

Now for usage of your X_test, y_test, you should first fit your estimtor on the train data (cross_val_predict will not fit) and then use it to predict on test data and then calculate accuracy.

Simple code snippet to describe the above (borrowing from your code) (Do read the comments and ask if not understand anything):

# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)

# Until here everything is good
# You keep away 20% of data for testing (test_size=0.2)
# This test data should be unseen by any of the below methods

# define method
logreg=LogisticRegression()

# Ideally what you are doing here should be correct, until you did anything wrong in dataframe operations (which apparently has been solved)
#cross valitadion prediction
#This cross validation prediction will print the predicted values of 't_train'
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
# internal working of cross_val_predict:
  #1. Get the data and estimator (logreg, X_train, t_train)
  #2. From here on, we will use X_train as X_cv and t_train as t_cv (because cross_val_predict doesnt know that its our training data) - Doubts??
  #3. Split X_cv, t_cv into X_cv_train, X_cv_test, t_cv_train, t_cv_test by using its internal cv
  #4. Use X_cv_train, t_cv_train for fitting 'logreg' 
  #5. Predict on X_cv_test (No use of t_cv_test)
  #6. Repeat steps 3 to 5 repeatedly for cv=10 iterations, each time using different data for training and different data for testing.

# So here you are correctly comparing 'predicted' and 't_train'
print(metrics.accuracy_score(t_train, predicted)) 

# The above metrics will show you how our estimator 'logreg' works on 'X_train' data. If the accuracies are very high it may be because of overfitting.

# Now what to do about the X_test and t_test above.
# Actually the correct preference for metrics is this X_test and t_train
# If you are satisfied by the accuracies on the training data then you should fit the entire training data to the estimator and then predict on X_test

logreg.fit(X_train, t_train)
t_pred = logreg(X_test)

# Here is the final accuracy
print(metrics.accuracy_score(t_test, t_pred)) 
# If this accuracy is good, then your model is good.

If you have less data or dont want to split the data into training and testing, then you should use the approach as suggested by @fuzzyhedge

# Use cross_val_score on your all data
scores = model_selection.cross_val_score(logreg, X, y, cv=10)

# 'cross_val_score' will almost work same from steps 1 to 4
  #5. t_cv_pred = logreg.predict(X_cv_test) and calculate accuracy with t_cv_test. 
  #6. Repeat steps 1 to 5 for cv_iterations = 10
  #7. Return array of accuracies calculated in step 5.

# Find out average of returned accuracies to see the model performance
scores = scores.mean()

Note – Also cross_validation is best used with gridsearch to find out parameters of the estimator which perform best for the given data.
For example, using LogisticRegression it has many parameters defined. But if you use

logreg = LogisticRegression() 

will initialize the model with only default parameters. Maybe a different value of parameter

logreg = LogisticRegression(penalty='l1', solver='liblinear') 

may perform better for your data. This search for better parameters is gridsearch.

Now as for your second part of scaling, dimension reductions etc using pipeline. You can refer to the documentation of pipeline and the following examples:

Answered By: Vivek Kumar