How to save each iterating Statsmodel as a file to be used later?
Question:
I have the following table generated:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# Generate 'random' data
np.random.seed(0)
X = 2.5 * np.random.randn(10) + 1.5
res = 0.5 * np.random.randn(10)
y = 2 + 0.3 * X + res
Name = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
# Create pandas dataframe to store our X and y values
df = pd.DataFrame(
{'Name': Name,
'X': X,
'y': y})
# Show the dataframe
df
Resulting in the following table:
Name
X
y
A
5.910131
3.845061
B
2.500393
3.477255
C
3.946845
3.564572
D
7.102233
4.191507
E
6.168895
4.072600
F
-0.943195
1.883879
G
3.875221
3.909606
H
1.121607
2.233903
I
1.241953
2.529120
J
2.526496
2.330901
The the following code iterates to exludes one row at a time, and builds a set of regression plots:
import statsmodels.formula.api as smf
import warnings
warnings.filterwarnings('ignore')
# Initialise and fit linear regression model using `statsmodels`
for row_index, row in df.iterrows():
# dataframe with all rows except for one
df_reduced = df[~(df.index == row_index)]
model = smf.ols('X ~ y', data=df_reduced)
model = model.fit()
intercept, slope = model.params
print(model.summary())
y1 = intercept + slope * df_reduced.y.min()
y2 = intercept + slope * df_reduced.y.max()
plt.plot([df_reduced.y.min(), df_reduced.y.max()], [y1, y2], label=row.Name, color='red')
plt.scatter(df_reduced.y, df_reduced.X)
plt.legend()
plt.savefig(f"All except {row.Name} analogue.pdf")
plt.show()
The question is, how can I save each of the models that are being generated as a file that can be used later ? In this present example, there should be at least 9 regression models being generated. I would like to have them each as a file that can be identified with a name as well.
Second question is, how can I add a space in between each of the model summary and plots in the visual generations of matplotlib.
Answers:
You just need to add this: model.save(f"model_{row_index}.pkl")
in you loop
Storing the trained models:
Assuming you have some naming procedure available for each model file mf, you can store a model using pickle.
import statsmodels.api as sm
import pickle
# Train your model
model = sm.OLS(y, X).fit()
# Save the model to a file
with open('model.pickle', 'wb') as f:
pickle.dump(model, f)
# Load the model from the file
with open('model.pickle', 'rb') as f:
loaded_model = pickle.load(f)
print(loaded_model.summary())
gives the following output.
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.525
Model: OLS Adj. R-squared (uncentered): 0.472
Method: Least Squares F-statistic: 9.931
Date: Mon, 03 Apr 2023 Prob (F-statistic): 0.0117
Time: 12:42:57 Log-Likelihood: -20.560
No. Observations: 10 AIC: 43.12
Df Residuals: 9 BIC: 43.42
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.8743 0.277 3.151 0.012 0.247 1.502
==============================================================================
Omnibus: 1.291 Durbin-Watson: 0.989
Prob(Omnibus): 0.524 Jarque-Bera (JB): 0.937
Skew: 0.637 Prob(JB): 0.626
Kurtosis: 2.209 Cond. No. 1.00
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Notice that the model import is a bit different from yours for simplification purposes. You should be able to store and load your model in the same manner however.
I am not entirely sure I understand your questions regarding spacing of the outputs and plots correctly.
Spacing the summaries:
Just add emtpy print() statements maybe?
Spacing the plots:
You are generating entirely new plots every time, hence I do not understand the question. Feel free to give additional info and I will get back to you.
I have the following table generated:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# Generate 'random' data
np.random.seed(0)
X = 2.5 * np.random.randn(10) + 1.5
res = 0.5 * np.random.randn(10)
y = 2 + 0.3 * X + res
Name = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
# Create pandas dataframe to store our X and y values
df = pd.DataFrame(
{'Name': Name,
'X': X,
'y': y})
# Show the dataframe
df
Resulting in the following table:
Name | X | y |
---|---|---|
A | 5.910131 | 3.845061 |
B | 2.500393 | 3.477255 |
C | 3.946845 | 3.564572 |
D | 7.102233 | 4.191507 |
E | 6.168895 | 4.072600 |
F | -0.943195 | 1.883879 |
G | 3.875221 | 3.909606 |
H | 1.121607 | 2.233903 |
I | 1.241953 | 2.529120 |
J | 2.526496 | 2.330901 |
The the following code iterates to exludes one row at a time, and builds a set of regression plots:
import statsmodels.formula.api as smf
import warnings
warnings.filterwarnings('ignore')
# Initialise and fit linear regression model using `statsmodels`
for row_index, row in df.iterrows():
# dataframe with all rows except for one
df_reduced = df[~(df.index == row_index)]
model = smf.ols('X ~ y', data=df_reduced)
model = model.fit()
intercept, slope = model.params
print(model.summary())
y1 = intercept + slope * df_reduced.y.min()
y2 = intercept + slope * df_reduced.y.max()
plt.plot([df_reduced.y.min(), df_reduced.y.max()], [y1, y2], label=row.Name, color='red')
plt.scatter(df_reduced.y, df_reduced.X)
plt.legend()
plt.savefig(f"All except {row.Name} analogue.pdf")
plt.show()
The question is, how can I save each of the models that are being generated as a file that can be used later ? In this present example, there should be at least 9 regression models being generated. I would like to have them each as a file that can be identified with a name as well.
Second question is, how can I add a space in between each of the model summary and plots in the visual generations of matplotlib.
You just need to add this: model.save(f"model_{row_index}.pkl")
in you loop
Storing the trained models:
Assuming you have some naming procedure available for each model file mf, you can store a model using pickle.
import statsmodels.api as sm
import pickle
# Train your model
model = sm.OLS(y, X).fit()
# Save the model to a file
with open('model.pickle', 'wb') as f:
pickle.dump(model, f)
# Load the model from the file
with open('model.pickle', 'rb') as f:
loaded_model = pickle.load(f)
print(loaded_model.summary())
gives the following output.
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.525
Model: OLS Adj. R-squared (uncentered): 0.472
Method: Least Squares F-statistic: 9.931
Date: Mon, 03 Apr 2023 Prob (F-statistic): 0.0117
Time: 12:42:57 Log-Likelihood: -20.560
No. Observations: 10 AIC: 43.12
Df Residuals: 9 BIC: 43.42
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.8743 0.277 3.151 0.012 0.247 1.502
==============================================================================
Omnibus: 1.291 Durbin-Watson: 0.989
Prob(Omnibus): 0.524 Jarque-Bera (JB): 0.937
Skew: 0.637 Prob(JB): 0.626
Kurtosis: 2.209 Cond. No. 1.00
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Notice that the model import is a bit different from yours for simplification purposes. You should be able to store and load your model in the same manner however.
I am not entirely sure I understand your questions regarding spacing of the outputs and plots correctly.
Spacing the summaries:
Just add emtpy print() statements maybe?
Spacing the plots:
You are generating entirely new plots every time, hence I do not understand the question. Feel free to give additional info and I will get back to you.