What is the difference between X_test, X_train, y_test, y_train in sklearn?

Question:

I’m learning sklearn and I didn’t understand very good the difference and why use 4 outputs with the function train_test_split().

In the Documentation, I found some examples but it wasn’t sufficient to end my doubts.

Does the code use the X_train to predict the X_test or use the X_train to predict the y_test?

What is the difference between train and test? Do I use train to predict the test or something similar?

I’m very confused about it. I will let below the example provided in the Documentation.

>>> import numpy as np  
>>> from sklearn.model_selection import train_test_split  
>>> X, y = np.arange(10).reshape((5, 2)), range(5)  
>>> X
array([[0, 1], 
       [2, 3],  
       [4, 5],  
       [6, 7],  
       [8, 9]])  
>>> list(y)  
[0, 1, 2, 3, 4] 
>>> X_train, X_test, y_train, y_test = train_test_split(  
...     X, y, test_size=0.33, random_state=42)  
...  
>>> X_train  
array([[4, 5], 
       [0, 1],  
       [6, 7]])  
>>> y_train  
[2, 0, 3]  
>>> X_test  
array([[2, 3], 
       [8, 9]])  
>>> y_test  
[1, 4]  
>>> train_test_split(y, shuffle=False)  
[[0, 1, 2], [3, 4]]
Asked By: Jancer Lima

||

Answers:

You’re supposed to train your classifier / regressor using your training set, and test / evaluate it using your testing set.

Your classifier / regressor uses x_train to predict y_pred and uses the difference between y_pred and y_train (through a loss function) to learn. Then you evaluate it by computing the loss between the predictions of x_test (that could also be named y_pred), and y_test.

Answered By: Thomas Schillaci

Below is a dummy pandas.DataFrame for example:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

df = pd.DataFrame({'X1':[100,120,140,200,230,400,500,540,600,625],
                       'X2':[14,15,22,24,23,31,33,35,40,40],
                       'Y':[0,0,0,0,1,1,1,1,1,1]})

Here we have 3 columns, X1,X2,Y
suppose X1 & X2 are your independent variables and 'Y' column is your dependent variable.

X = df[['X1','X2']]
y = df['Y']

With sklearn.model_selection.train_test_split you are creating 4 portions of data which will be used for fitting & predicting values.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4,random_state=42) 

X_train, X_test, y_train, y_test

Now

1). X_train – This includes your all independent variables,these will be used to train the model, also as we have specified the test_size = 0.4, this means 60% of observations from your complete data will be used to train/fit the model and rest 40% will be used to test the model.

2). X_test – This is remaining 40% portion of the independent variables from the data which will not be used in the training phase and will be used to make predictions to test the accuracy of the model.

3). y_train – This is your dependent variable which needs to be predicted by this model, this includes category labels against your independent variables, we need to specify our dependent variable while training/fitting the model.

4). y_test – This data has category labels for your test data, these labels will be used to test the accuracy between actual and predicted categories.

Now you can fit a model on this data, let’s fit sklearn.linear_model.LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train) #This is where the training is taking place
y_pred_logreg = logreg.predict(X_test) #Making predictions to test the model on test data
print('Logistic Regression Train accuracy %s' % logreg.score(X_train, y_train)) #Train accuracy
#Logistic Regression Train accuracy 0.8333333333333334
print('Logistic Regression Test accuracy %s' % accuracy_score(y_pred_logreg, y_test)) #Test accuracy
#Logistic Regression Test accuracy 0.5
print(confusion_matrix(y_test, y_pred_logreg)) #Confusion matrix
print(classification_report(y_test, y_pred_logreg)) #Classification Report

You can read more about metrics here

Read more about data split here

Hope this helps:)

Answered By: ManojK

Consider X as 1000 data points and Y as integer class label (to which class each data point belongs)

Eg:
X = [1.24 2.36 3.24 … (1000 terms)
Y = [1,0,0,1…..(1000 terms)]

We are splitting in 600:400 ratio

X_train => will have 600 data points

Y_train => will have 400 data points

X_test=> will have class labels corresponding to 600 data points

Y_test=> will have class labels corresponding to 400 data points

Let’s say we have this data

Age    Sex       Disease
----  ------ |  ---------
  
  X_train    |   y_train   )
                           )
 5       F   |  A Disease  )
 15      M   |  B Disease  ) 
 23      M   |  B Disease  ) training
 39      M   |  B Disease  ) data
 61      F   |  C Disease  )
 55      M   |  F Disease  )
 76      F   |  D Disease  )
 88      F   |  G Disease  )
-------------|------------
   
  X_test     |    y_test

 63      M   |  C Disease  )
 46      F   |  C Disease  ) test
 28      M   |  B Disease  ) data
 33      F   |  B Disease  )

X_train contains the values of the features (age and sex => training data)

y_train contains the target output corresponding to X_train values (disease => training data) (what values we should find after training process)

There are also values generated after training process (predictions) which should be very close or the same with y_train values if the model is a successful one.

X_test contains the values of the features to be tested after training (age and sex => test data)

y_test contains the target output (disease => test data) corresponding to X_test (age and sex => training data) and will be compared to prediction value with given X_test values of the model after training in order to determine how successful the model is.

Answered By: caner