Is there something similar to R's brglm to help deal with quasi-separation in Python using statsmodels Logit?
Question:
I am using Logit from statsmodels to create a regression model.
I get the error: LinAlgError: Singular matrix and then when I remove 1 variable at a time from my dataset, I finally got a different error: PerfectSeparationError: Perfect separation detected, results not available.
I suspect that the original error (LinAlgError) is related to perfect separation because I had the same problem in R and got around it using a brglm (bias reduced glm).
I have a boolean y variable and 23 numeric and boolean x variables.
I have already run a VIF function to remove any variables which have high multicollinearity scores (I started with 26 variables).
I have tried using the firth_regression.py instead to account for perfect separation but I got a memory error: MemoryError.(https://gist.github.com/johnlees/3e06380965f367e4894ea20fbae2b90d)
I have tried the LogisticRegression from sklearn but cannot get the p values which is no good to me.
I even tried removing 1 variable at a time from my dataset. When I got down to 4 variables left (I had 23), then I got PerfectSeparationError: Perfect separation detected, results not available.
Has anyone experienced this and how do you get around it?
Appreciate any advice!
X = df.loc[:, df.columns != 'VehicleMake']
y = df.iloc[:,0]
# Split data
X_train, X_test, y_train, y_test = skl.model_selection.train_test_split(X, y, test_size=0.3)
Code in question:
# Perform logistic regression and get p values
logit_model = sm.Logit(y_train, X_train.astype(float))
result = logit_model.fit()
This is the firth_regression I tried instead which got me a memory error:
# For the firth_regression
import sys
import warnings
import math
import statsmodels
from scipy import stats
import statsmodels.formula.api as smf
def firth_likelihood(beta, logit):
return -(logit.loglike(beta) + 0.5*np.log(np.linalg.det(-logit.hessian(beta))))
step_limit=1000
convergence_limit=0.0001
logit_model = smf.Logit(y_train, X_train.astype(float))
start_vec = np.zeros(X.shape[1])
beta_iterations = []
beta_iterations.append(start_vec)
for i in range(0, step_limit):
pi = logit_model.predict(beta_iterations[i])
W = np.diagflat(np.multiply(pi, 1-pi))
var_covar_mat = np.linalg.pinv(-logit_model.hessian(beta_iterations[i]))
# build hat matrix
rootW = np.sqrt(W)
H = np.dot(np.transpose(X_train), np.transpose(rootW))
H = np.matmul(var_covar_mat, H)
H = np.matmul(np.dot(rootW, X), H)
# penalised score
U = np.matmul(np.transpose(X_train), y - pi + np.multiply(np.diagonal(H), 0.5 - pi))
new_beta = beta_iterations[i] + np.matmul(var_covar_mat, U)
# step halving
j = 0
while firth_likelihood(new_beta, logit_model) > firth_likelihood(beta_iterations[i], logit_model):
new_beta = beta_iterations[i] + 0.5*(new_beta - beta_iterations[i])
j = j + 1
if (j > step_limit):
sys.stderr.write('Firth regression failedn')
None
beta_iterations.append(new_beta)
if i > 0 and (np.linalg.norm(beta_iterations[i] - beta_iterations[i-1]) < convergence_limit):
break
return_fit = None
if np.linalg.norm(beta_iterations[i] - beta_iterations[i-1]) >= convergence_limit:
sys.stderr.write('Firth regression failedn')
else:
# Calculate stats
fitll = -firth_likelihood(beta_iterations[-1], logit_model)
intercept = beta_iterations[-1][0]
beta = beta_iterations[-1][1:].tolist()
bse = np.sqrt(np.diagonal(-logit_model.hessian(beta_iterations[-1])))
return_fit = intercept, beta, bse, fitll
#print(return_fit)
Answers:
I fixed my problem by changing the default method in the logit regression to method =’bfgs’.
result = logit_model.fit(method = 'bfgs')
Few years late for this question, but I’m working on a Python implementation of Firth logistic regression using the procedure detailed in the R logistf package and Heinze and Schemper, 2002. There are a few implementation differences compared to the gist you linked that make it much more memory efficient, and p-values are calculated using penalized likelihood ratio tests. Confidence intervals are also calculated.
Obviously I don’t have your data, so let’s use the sex2
dataset included with the logistf
R package.
>>> from firthlogist import FirthLogisticRegression, load_sex2
>>> fl = FirthLogisticRegression()
>>> X, y, feature_names = load_sex2()
>>> fl.fit(X, y)
FirthLogisticRegression()
>>> fl.summary(xname=feature_names)
coef std err [0.025 0.975] p-value
--------- ---------- --------- --------- ---------- -----------
age -1.10598 0.42366 -1.97379 -0.307427 0.00611139
oc -0.0688167 0.443793 -0.941436 0.789202 0.826365
vic 2.26887 0.548416 1.27304 3.43543 1.67219e-06
vicl -2.11141 0.543082 -3.26086 -1.11774 1.23618e-05
vis -0.788317 0.417368 -1.60809 0.0151846 0.0534899
dia 3.09601 1.67501 0.774568 8.03028 0.00484687
Intercept 0.120254 0.485542 -0.818559 1.07315 0.766584
Log-Likelihood: -132.5394
Newton-Raphson iterations: 8
Compare results with brglm
:
> library(brglm)
Loading required package: profileModel
'brglm' will gradually be superseded by the 'brglm2' R package (https://cran.r-project.org/package=brglm2), which provides utilities for mean and median bias reduction for all GLMs.
Methods for the detection of separation and infinite estimates in binomial-response models are provided by the 'detectseparation' R package (https://cran.r-project.org/package=detectseparation).
> fit <- brglm(case~age+oc+vic+vicl+vis+dia, data=logistf::sex2)
> summary(fit)
Call:
brglm(formula = case ~ age + oc + vic + vicl + vis + dia, data = logistf::sex2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.12025 0.48554 0.248 0.804390
age -1.10598 0.42366 -2.611 0.009040 **
oc -0.06882 0.44379 -0.155 0.876770
vic 2.26887 0.54842 4.137 3.52e-05 ***
vicl -2.11141 0.54308 -3.888 0.000101 ***
vis -0.78832 0.41737 -1.889 0.058921 .
dia 3.09601 1.67501 1.848 0.064551 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 304.61 on 238 degrees of freedom
Residual deviance: 276.91 on 232 degrees of freedom
Penalized deviance: 265.0788
AIC: 290.91
The p-values are slightly different because they are calculated by penalized likelihood ratio tests. I think brglm
uses Wald tests.
I am using Logit from statsmodels to create a regression model.
I get the error: LinAlgError: Singular matrix and then when I remove 1 variable at a time from my dataset, I finally got a different error: PerfectSeparationError: Perfect separation detected, results not available.
I suspect that the original error (LinAlgError) is related to perfect separation because I had the same problem in R and got around it using a brglm (bias reduced glm).
I have a boolean y variable and 23 numeric and boolean x variables.
I have already run a VIF function to remove any variables which have high multicollinearity scores (I started with 26 variables).
I have tried using the firth_regression.py instead to account for perfect separation but I got a memory error: MemoryError.(https://gist.github.com/johnlees/3e06380965f367e4894ea20fbae2b90d)
I have tried the LogisticRegression from sklearn but cannot get the p values which is no good to me.
I even tried removing 1 variable at a time from my dataset. When I got down to 4 variables left (I had 23), then I got PerfectSeparationError: Perfect separation detected, results not available.
Has anyone experienced this and how do you get around it?
Appreciate any advice!
X = df.loc[:, df.columns != 'VehicleMake']
y = df.iloc[:,0]
# Split data
X_train, X_test, y_train, y_test = skl.model_selection.train_test_split(X, y, test_size=0.3)
Code in question:
# Perform logistic regression and get p values
logit_model = sm.Logit(y_train, X_train.astype(float))
result = logit_model.fit()
This is the firth_regression I tried instead which got me a memory error:
# For the firth_regression
import sys
import warnings
import math
import statsmodels
from scipy import stats
import statsmodels.formula.api as smf
def firth_likelihood(beta, logit):
return -(logit.loglike(beta) + 0.5*np.log(np.linalg.det(-logit.hessian(beta))))
step_limit=1000
convergence_limit=0.0001
logit_model = smf.Logit(y_train, X_train.astype(float))
start_vec = np.zeros(X.shape[1])
beta_iterations = []
beta_iterations.append(start_vec)
for i in range(0, step_limit):
pi = logit_model.predict(beta_iterations[i])
W = np.diagflat(np.multiply(pi, 1-pi))
var_covar_mat = np.linalg.pinv(-logit_model.hessian(beta_iterations[i]))
# build hat matrix
rootW = np.sqrt(W)
H = np.dot(np.transpose(X_train), np.transpose(rootW))
H = np.matmul(var_covar_mat, H)
H = np.matmul(np.dot(rootW, X), H)
# penalised score
U = np.matmul(np.transpose(X_train), y - pi + np.multiply(np.diagonal(H), 0.5 - pi))
new_beta = beta_iterations[i] + np.matmul(var_covar_mat, U)
# step halving
j = 0
while firth_likelihood(new_beta, logit_model) > firth_likelihood(beta_iterations[i], logit_model):
new_beta = beta_iterations[i] + 0.5*(new_beta - beta_iterations[i])
j = j + 1
if (j > step_limit):
sys.stderr.write('Firth regression failedn')
None
beta_iterations.append(new_beta)
if i > 0 and (np.linalg.norm(beta_iterations[i] - beta_iterations[i-1]) < convergence_limit):
break
return_fit = None
if np.linalg.norm(beta_iterations[i] - beta_iterations[i-1]) >= convergence_limit:
sys.stderr.write('Firth regression failedn')
else:
# Calculate stats
fitll = -firth_likelihood(beta_iterations[-1], logit_model)
intercept = beta_iterations[-1][0]
beta = beta_iterations[-1][1:].tolist()
bse = np.sqrt(np.diagonal(-logit_model.hessian(beta_iterations[-1])))
return_fit = intercept, beta, bse, fitll
#print(return_fit)
I fixed my problem by changing the default method in the logit regression to method =’bfgs’.
result = logit_model.fit(method = 'bfgs')
Few years late for this question, but I’m working on a Python implementation of Firth logistic regression using the procedure detailed in the R logistf package and Heinze and Schemper, 2002. There are a few implementation differences compared to the gist you linked that make it much more memory efficient, and p-values are calculated using penalized likelihood ratio tests. Confidence intervals are also calculated.
Obviously I don’t have your data, so let’s use the sex2
dataset included with the logistf
R package.
>>> from firthlogist import FirthLogisticRegression, load_sex2
>>> fl = FirthLogisticRegression()
>>> X, y, feature_names = load_sex2()
>>> fl.fit(X, y)
FirthLogisticRegression()
>>> fl.summary(xname=feature_names)
coef std err [0.025 0.975] p-value
--------- ---------- --------- --------- ---------- -----------
age -1.10598 0.42366 -1.97379 -0.307427 0.00611139
oc -0.0688167 0.443793 -0.941436 0.789202 0.826365
vic 2.26887 0.548416 1.27304 3.43543 1.67219e-06
vicl -2.11141 0.543082 -3.26086 -1.11774 1.23618e-05
vis -0.788317 0.417368 -1.60809 0.0151846 0.0534899
dia 3.09601 1.67501 0.774568 8.03028 0.00484687
Intercept 0.120254 0.485542 -0.818559 1.07315 0.766584
Log-Likelihood: -132.5394
Newton-Raphson iterations: 8
Compare results with brglm
:
> library(brglm)
Loading required package: profileModel
'brglm' will gradually be superseded by the 'brglm2' R package (https://cran.r-project.org/package=brglm2), which provides utilities for mean and median bias reduction for all GLMs.
Methods for the detection of separation and infinite estimates in binomial-response models are provided by the 'detectseparation' R package (https://cran.r-project.org/package=detectseparation).
> fit <- brglm(case~age+oc+vic+vicl+vis+dia, data=logistf::sex2)
> summary(fit)
Call:
brglm(formula = case ~ age + oc + vic + vicl + vis + dia, data = logistf::sex2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.12025 0.48554 0.248 0.804390
age -1.10598 0.42366 -2.611 0.009040 **
oc -0.06882 0.44379 -0.155 0.876770
vic 2.26887 0.54842 4.137 3.52e-05 ***
vicl -2.11141 0.54308 -3.888 0.000101 ***
vis -0.78832 0.41737 -1.889 0.058921 .
dia 3.09601 1.67501 1.848 0.064551 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 304.61 on 238 degrees of freedom
Residual deviance: 276.91 on 232 degrees of freedom
Penalized deviance: 265.0788
AIC: 290.91
The p-values are slightly different because they are calculated by penalized likelihood ratio tests. I think brglm
uses Wald tests.