Why is cross_val_score not producing consistent results?

Question

When this code executes the results are not consistent.
Where is the randomness coming from?

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

seed = 42
iris = datasets.load_iris()
X = iris.data
y = iris.target

pipeline = Pipeline([('std', StandardScaler()), 
                     ('pca', PCA(n_components = 4)), 
                     ('Decision_tree', DecisionTreeClassifier())], 
                    verbose = False)

kfold = KFold(n_splits = 10, random_state = seed, shuffle = True)
results = cross_val_score(pipeline, X, y, cv = kfold)
print(results.mean())


0.9466666666666667
0.9266666666666665
0.9466666666666667
0.9400000000000001
0.9266666666666665

Asked By: nicomp

||

Source

Answer 1

DecisionTreeClassifier does not use all columns, but by default the sqrt of the number of columns for each split. You assigned the seed to KFold, but not to DecisionTreeClassifier. So different columns will be selected each run. PCA also accepts a random state.

See DecisionTreeClassifier and PCA

Answered By: Jan

Why is cross_val_score not producing consistent results?

Question:

Answers: