Why does adding duplicated features improve Logistic Regression accuracy?

Question:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True)

for i in range(5):
    X_redundant = np.c_[X,X[:,:i]] # repeating redundant features
    print(X_redundant.shape)
    clf = LogisticRegression(random_state=0,max_iter=1000).fit(X_redundant, y)
    print(clf.score(X_redundant, y))

Output

(150, 4)
0.9733333333333334
(150, 5)
0.98
(150, 6)
0.98
(150, 7)
0.9866666666666667
(150, 8)
0.9866666666666667

Question: Why is the score (default being Accuracy) increasing as more redundant features are added for Logistic Regression?

I expect the score to remain the same, by drawing analogies from LinearRegression’s behaviour.

If it was LinearRegression, the score (default R2) will not change because as more columns are added because LinearRegression will evenly distribute coef between each of the 2 redundant coefficients

from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression

X, y = load_iris(return_X_y=True)
X, y = X[:,:-1],X[:,-1]
 
for i in range(4):
    X_redundant = np.c_[X,X[:,:i]] # repeating redundant features
    print(X_redundant.shape)
    clf = LinearRegression().fit(X_redundant, y)
    print(clf.score(X_redundant, y))
    print(clf.coef_)

Output

(150, 3)
0.9378502736046809
[-0.20726607  0.22282854  0.52408311]
(150, 4)
0.9378502736046809
[-0.10363304  0.22282854  0.52408311 -0.10363304]
(150, 5)
0.9378502736046809
[-0.10363304  0.11141427  0.52408311 -0.10363304  0.11141427]
(150, 6)
0.9378502736046809
[-0.10363304  0.11141427  0.26204156 -0.10363304  0.11141427  0.26204156]
Asked By: Han Qi

||

Answers:

This is because LogisticRegression applies regularization by default. Set penalty="none" or penalty=None (depending on your version of sklearn) and you should see the behavior you expected.

Answered By: Ben Reiniger