Using statsmodel estimations with scikit-learn cross validation, is it possible?

python scikit-learn cross-validation statsmodels

Indeed, you cannot use cross_val_score directly on statsmodels objects, because of different interface: in statsmodels

training data is passed directly into the constructor
a separate object contains the result of model estimation

However, you can write a simple wrapper to make statsmodels objects look like sklearn estimators:

import statsmodels.api as smfrom sklearn.base import BaseEstimator, RegressorMixinclass SMWrapper(BaseEstimator, RegressorMixin):    """ A universal sklearn-style wrapper for statsmodels regressors """    def __init__(self, model_class, fit_intercept=True):        self.model_class = model_class        self.fit_intercept = fit_intercept    def fit(self, X, y):        if self.fit_intercept:            X = sm.add_constant(X)        self.model_ = self.model_class(y, X)        self.results_ = self.model_.fit()        return self    def predict(self, X):        if self.fit_intercept:            X = sm.add_constant(X)        return self.results_.predict(X)

This class contains correct fit and predict methods, and can be used with sklearn, e.g. cross-validated or included into a pipeline. Like here:

from sklearn.datasets import make_regressionfrom sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LinearRegressionX, y = make_regression(random_state=1, n_samples=300, noise=100)print(cross_val_score(SMWrapper(sm.OLS), X, y, scoring='r2'))print(cross_val_score(LinearRegression(), X, y, scoring='r2'))

You can see that the output of two models is identical, because they are both OLS models, cross-validated in the same way.

[0.28592315 0.37367557 0.47972639][0.28592315 0.37367557 0.47972639]

python scikit-learn cross-validation statsmodels

Following the suggestion of David (which gave me an error, complaining about missing function get_parameters) and the scikit learn documentation, I created the following wrapper for a linear regression. It has the same interface of sklearn.linear_model.LinearRegression but in addition has also the function summary(), which gives the info about p-values, R2 and other statistics, as in statsmodels.OLS.

import statsmodels.api as smfrom sklearn.base import BaseEstimator, RegressorMixinimport pandas as pdimport numpy as npfrom sklearn.utils.multiclass import check_classification_targetsfrom sklearn.utils.validation import check_X_y, check_is_fitted, check_arrayfrom sklearn.utils.multiclass import unique_labelsfrom sklearn.utils.estimator_checks import check_estimatorclass MyLinearRegression(BaseEstimator, RegressorMixin):    def __init__(self, fit_intercept=True):        self.fit_intercept = fit_intercept    """    Parameters    ------------    column_names: list            It is an optional value, such that this class knows             what is the name of the feature to associate to             each column of X. This is useful if you use the method            summary(), so that it can show the feature name for each            coefficient    """     def fit(self, X, y, column_names=() ):        if self.fit_intercept:            X = sm.add_constant(X)        # Check that X and y have correct shape        X, y = check_X_y(X, y)        self.X_ = X        self.y_ = y        if len(column_names) != 0:            cols = column_names.copy()            cols = list(cols)            X = pd.DataFrame(X)            cols = column_names.copy()            cols.insert(0,'intercept')            print('X ', X)            X.columns = cols        self.model_ = sm.OLS(y, X)        self.results_ = self.model_.fit()        return self    def predict(self, X):        # Check is fit had been called        check_is_fitted(self, 'model_')        # Input validation        X = check_array(X)        if self.fit_intercept:            X = sm.add_constant(X)        return self.results_.predict(X)    def get_params(self, deep = False):        return {'fit_intercept':self.fit_intercept}    def summary(self):        print(self.results_.summary() )

Example of use:

cols = ['feature1','feature2']X_train = df_train[cols].valuesX_test = df_test[cols].valuesy_train = df_train['label']y_test = df_test['label']model = MyLinearRegression()model.fit(X_train, y_train)model.summary()model.predict(X_test)

If you want to show the names of the columns, you can call

model.fit(X_train, y_train, column_names=cols)

To use it in cross_validation:

from sklearn.model_selection import cross_val_scorescores = cross_val_score(MyLinearRegression(), X_train, y_train, cv=10, scoring='neg_mean_squared_error')scores

python scikit-learn cross-validation statsmodels

For reference purpose, if you use the statsmodels formula API and/or use the fit_regularized method, you can modify @David Dale's wrapper class in this way.

import pandas as pdfrom sklearn.base import BaseEstimator, RegressorMixinfrom statsmodels.formula.api import glm as glm_sm# This is an example wrapper for statsmodels GLMclass SMWrapper(BaseEstimator, RegressorMixin):    def __init__(self, family, formula, alpha, L1_wt):        self.family = family        self.formula = formula        self.alpha = alpha        self.L1_wt = L1_wt        self.model = None        self.result = None    def fit(self, X, y):        data = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)        data.columns = X.columns.tolist() + ['y']        self.model = glm_sm(self.formula, data, family=self.family)        self.result = self.model.fit_regularized(alpha=self.alpha, L1_wt=self.L1_wt, refit=True)        return self.result    def predict(self, X):        return self.result.predict(X)

CodeHunter

Using statsmodel estimations with scikit-learn cross validation, is it possible?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last