Using statsmodel estimations with scikit-learn cross validation, is it possible?
Indeed, you cannot use cross_val_score
directly on statsmodels
objects, because of different interface: in statsmodels
- training data is passed directly into the constructor
- a separate object contains the result of model estimation
However, you can write a simple wrapper to make statsmodels
objects look like sklearn
estimators:
import statsmodels.api as smfrom sklearn.base import BaseEstimator, RegressorMixinclass SMWrapper(BaseEstimator, RegressorMixin): """ A universal sklearn-style wrapper for statsmodels regressors """ def __init__(self, model_class, fit_intercept=True): self.model_class = model_class self.fit_intercept = fit_intercept def fit(self, X, y): if self.fit_intercept: X = sm.add_constant(X) self.model_ = self.model_class(y, X) self.results_ = self.model_.fit() return self def predict(self, X): if self.fit_intercept: X = sm.add_constant(X) return self.results_.predict(X)
This class contains correct fit
and predict
methods, and can be used with sklearn
, e.g. cross-validated or included into a pipeline. Like here:
from sklearn.datasets import make_regressionfrom sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LinearRegressionX, y = make_regression(random_state=1, n_samples=300, noise=100)print(cross_val_score(SMWrapper(sm.OLS), X, y, scoring='r2'))print(cross_val_score(LinearRegression(), X, y, scoring='r2'))
You can see that the output of two models is identical, because they are both OLS models, cross-validated in the same way.
[0.28592315 0.37367557 0.47972639][0.28592315 0.37367557 0.47972639]
Following the suggestion of David (which gave me an error, complaining about missing function get_parameters
) and the scikit learn documentation, I created the following wrapper for a linear regression. It has the same interface of sklearn.linear_model.LinearRegression
but in addition has also the function summary()
, which gives the info about p-values, R2 and other statistics, as in statsmodels.OLS
.
import statsmodels.api as smfrom sklearn.base import BaseEstimator, RegressorMixinimport pandas as pdimport numpy as npfrom sklearn.utils.multiclass import check_classification_targetsfrom sklearn.utils.validation import check_X_y, check_is_fitted, check_arrayfrom sklearn.utils.multiclass import unique_labelsfrom sklearn.utils.estimator_checks import check_estimatorclass MyLinearRegression(BaseEstimator, RegressorMixin): def __init__(self, fit_intercept=True): self.fit_intercept = fit_intercept """ Parameters ------------ column_names: list It is an optional value, such that this class knows what is the name of the feature to associate to each column of X. This is useful if you use the method summary(), so that it can show the feature name for each coefficient """ def fit(self, X, y, column_names=() ): if self.fit_intercept: X = sm.add_constant(X) # Check that X and y have correct shape X, y = check_X_y(X, y) self.X_ = X self.y_ = y if len(column_names) != 0: cols = column_names.copy() cols = list(cols) X = pd.DataFrame(X) cols = column_names.copy() cols.insert(0,'intercept') print('X ', X) X.columns = cols self.model_ = sm.OLS(y, X) self.results_ = self.model_.fit() return self def predict(self, X): # Check is fit had been called check_is_fitted(self, 'model_') # Input validation X = check_array(X) if self.fit_intercept: X = sm.add_constant(X) return self.results_.predict(X) def get_params(self, deep = False): return {'fit_intercept':self.fit_intercept} def summary(self): print(self.results_.summary() )
Example of use:
cols = ['feature1','feature2']X_train = df_train[cols].valuesX_test = df_test[cols].valuesy_train = df_train['label']y_test = df_test['label']model = MyLinearRegression()model.fit(X_train, y_train)model.summary()model.predict(X_test)
If you want to show the names of the columns, you can call
model.fit(X_train, y_train, column_names=cols)
To use it in cross_validation:
from sklearn.model_selection import cross_val_scorescores = cross_val_score(MyLinearRegression(), X_train, y_train, cv=10, scoring='neg_mean_squared_error')scores
For reference purpose, if you use the statsmodels
formula API and/or use the fit_regularized
method, you can modify @David Dale's wrapper class in this way.
import pandas as pdfrom sklearn.base import BaseEstimator, RegressorMixinfrom statsmodels.formula.api import glm as glm_sm# This is an example wrapper for statsmodels GLMclass SMWrapper(BaseEstimator, RegressorMixin): def __init__(self, family, formula, alpha, L1_wt): self.family = family self.formula = formula self.alpha = alpha self.L1_wt = L1_wt self.model = None self.result = None def fit(self, X, y): data = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1) data.columns = X.columns.tolist() + ['y'] self.model = glm_sm(self.formula, data, family=self.family) self.result = self.model.fit_regularized(alpha=self.alpha, L1_wt=self.L1_wt, refit=True) return self.result def predict(self, X): return self.result.predict(X)