Using statsmodel estimations with scikit-learn cross validation, is it possible? Using statsmodel estimations with scikit-learn cross validation, is it possible? python python

Using statsmodel estimations with scikit-learn cross validation, is it possible?


Indeed, you cannot use cross_val_score directly on statsmodels objects, because of different interface: in statsmodels

  • training data is passed directly into the constructor
  • a separate object contains the result of model estimation

However, you can write a simple wrapper to make statsmodels objects look like sklearn estimators:

import statsmodels.api as smfrom sklearn.base import BaseEstimator, RegressorMixinclass SMWrapper(BaseEstimator, RegressorMixin):    """ A universal sklearn-style wrapper for statsmodels regressors """    def __init__(self, model_class, fit_intercept=True):        self.model_class = model_class        self.fit_intercept = fit_intercept    def fit(self, X, y):        if self.fit_intercept:            X = sm.add_constant(X)        self.model_ = self.model_class(y, X)        self.results_ = self.model_.fit()        return self    def predict(self, X):        if self.fit_intercept:            X = sm.add_constant(X)        return self.results_.predict(X)

This class contains correct fit and predict methods, and can be used with sklearn, e.g. cross-validated or included into a pipeline. Like here:

from sklearn.datasets import make_regressionfrom sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LinearRegressionX, y = make_regression(random_state=1, n_samples=300, noise=100)print(cross_val_score(SMWrapper(sm.OLS), X, y, scoring='r2'))print(cross_val_score(LinearRegression(), X, y, scoring='r2'))

You can see that the output of two models is identical, because they are both OLS models, cross-validated in the same way.

[0.28592315 0.37367557 0.47972639][0.28592315 0.37367557 0.47972639]


Following the suggestion of David (which gave me an error, complaining about missing function get_parameters) and the scikit learn documentation, I created the following wrapper for a linear regression. It has the same interface of sklearn.linear_model.LinearRegression but in addition has also the function summary(), which gives the info about p-values, R2 and other statistics, as in statsmodels.OLS.

import statsmodels.api as smfrom sklearn.base import BaseEstimator, RegressorMixinimport pandas as pdimport numpy as npfrom sklearn.utils.multiclass import check_classification_targetsfrom sklearn.utils.validation import check_X_y, check_is_fitted, check_arrayfrom sklearn.utils.multiclass import unique_labelsfrom sklearn.utils.estimator_checks import check_estimatorclass MyLinearRegression(BaseEstimator, RegressorMixin):    def __init__(self, fit_intercept=True):        self.fit_intercept = fit_intercept    """    Parameters    ------------    column_names: list            It is an optional value, such that this class knows             what is the name of the feature to associate to             each column of X. This is useful if you use the method            summary(), so that it can show the feature name for each            coefficient    """     def fit(self, X, y, column_names=() ):        if self.fit_intercept:            X = sm.add_constant(X)        # Check that X and y have correct shape        X, y = check_X_y(X, y)        self.X_ = X        self.y_ = y        if len(column_names) != 0:            cols = column_names.copy()            cols = list(cols)            X = pd.DataFrame(X)            cols = column_names.copy()            cols.insert(0,'intercept')            print('X ', X)            X.columns = cols        self.model_ = sm.OLS(y, X)        self.results_ = self.model_.fit()        return self    def predict(self, X):        # Check is fit had been called        check_is_fitted(self, 'model_')        # Input validation        X = check_array(X)        if self.fit_intercept:            X = sm.add_constant(X)        return self.results_.predict(X)    def get_params(self, deep = False):        return {'fit_intercept':self.fit_intercept}    def summary(self):        print(self.results_.summary() )

Example of use:

cols = ['feature1','feature2']X_train = df_train[cols].valuesX_test = df_test[cols].valuesy_train = df_train['label']y_test = df_test['label']model = MyLinearRegression()model.fit(X_train, y_train)model.summary()model.predict(X_test)

If you want to show the names of the columns, you can call

model.fit(X_train, y_train, column_names=cols)

To use it in cross_validation:

from sklearn.model_selection import cross_val_scorescores = cross_val_score(MyLinearRegression(), X_train, y_train, cv=10, scoring='neg_mean_squared_error')scores


For reference purpose, if you use the statsmodels formula API and/or use the fit_regularized method, you can modify @David Dale's wrapper class in this way.

import pandas as pdfrom sklearn.base import BaseEstimator, RegressorMixinfrom statsmodels.formula.api import glm as glm_sm# This is an example wrapper for statsmodels GLMclass SMWrapper(BaseEstimator, RegressorMixin):    def __init__(self, family, formula, alpha, L1_wt):        self.family = family        self.formula = formula        self.alpha = alpha        self.L1_wt = L1_wt        self.model = None        self.result = None    def fit(self, X, y):        data = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)        data.columns = X.columns.tolist() + ['y']        self.model = glm_sm(self.formula, data, family=self.family)        self.result = self.model.fit_regularized(alpha=self.alpha, L1_wt=self.L1_wt, refit=True)        return self.result    def predict(self, X):        return self.result.predict(X)