OLS Regression: Scikit vs. Statsmodels? [closed] OLS Regression: Scikit vs. Statsmodels? [closed] python python

OLS Regression: Scikit vs. Statsmodels? [closed]


It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here's an example to show you which options you need to use for sklearn and statsmodels to produce identical results.

import numpy as npimport statsmodels.api as smfrom sklearn.linear_model import LinearRegression# Generate artificial data (2 regressors + constant)nobs = 100 X = np.random.random((nobs, 2)) X = sm.add_constant(X)beta = [1, .1, .5] e = np.random.random(nobs)y = np.dot(X, beta) + e # Fit regression modelsm.OLS(y, X).fit().params>> array([ 1.4507724 ,  0.08612654,  0.60129898])LinearRegression(fit_intercept=False).fit(X, y).coef_>> array([ 1.4507724 ,  0.08612654,  0.60129898])

As a commenter suggested, even if you are giving both programs the same X, X may not have full column rank, and they sm/sk could be taking (different) actions under-the-hood to make the OLS computation go through (i.e. dropping different columns).

I recommend you use pandas and patsy to take care of this:

import pandas as pdfrom patsy import dmatricesdat = pd.read_csv('wow.csv')y, X = dmatrices('levels ~ week + character + guild', data=dat)

Or, alternatively, the statsmodels formula interface:

import statsmodels.formula.api as smfdat = pd.read_csv('wow.csv')mod = smf.ols('levels ~ week + character + guild', data=dat).fit()

Edit: This example might be useful: http://statsmodels.sourceforge.net/devel/example_formulas.html


If you use statsmodels, I would highly recommend using the statsmodels formula interface instead. You will get the same old result from OLS using the statsmodels formula interface as you would from sklearn.linear_model.LinearRegression, or R, or SAS, or Excel.

smod = smf.ols(formula ='y~ x', data=df)result = smod.fit()print(result.summary())

When in doubt, please

  1. try reading the source code
  2. try a different language for benchmark, or
  3. try OLS from scratch, which is basic linear algebra.


i just wanted to add here, that in terms of sklearn, it does not use OLS method for linear regression under the hood. Since sklearn comes from the data-mining/machine-learning realm, they like to use Steepest Descent Gradient algorithm. This is a numerical method that is sensitive to initial conditions etc, while the OLS is an analytical closed form approach, so one should expect differences. So statsmodels comes from classical statistics field hence they would use OLS technique. So there are differences between the two linear regressions from the 2 different libraries