OLS Regression: Scikit vs. Statsmodels? [closed]

python scikit-learn linear-regression statsmodels

It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here's an example to show you which options you need to use for sklearn and statsmodels to produce identical results.

import numpy as npimport statsmodels.api as smfrom sklearn.linear_model import LinearRegression# Generate artificial data (2 regressors + constant)nobs = 100 X = np.random.random((nobs, 2)) X = sm.add_constant(X)beta = [1, .1, .5] e = np.random.random(nobs)y = np.dot(X, beta) + e # Fit regression modelsm.OLS(y, X).fit().params>> array([ 1.4507724 ,  0.08612654,  0.60129898])LinearRegression(fit_intercept=False).fit(X, y).coef_>> array([ 1.4507724 ,  0.08612654,  0.60129898])

As a commenter suggested, even if you are giving both programs the same X, X may not have full column rank, and they sm/sk could be taking (different) actions under-the-hood to make the OLS computation go through (i.e. dropping different columns).

I recommend you use pandas and patsy to take care of this:

import pandas as pdfrom patsy import dmatricesdat = pd.read_csv('wow.csv')y, X = dmatrices('levels ~ week + character + guild', data=dat)

Or, alternatively, the statsmodels formula interface:

import statsmodels.formula.api as smfdat = pd.read_csv('wow.csv')mod = smf.ols('levels ~ week + character + guild', data=dat).fit()

Edit: This example might be useful: http://statsmodels.sourceforge.net/devel/example_formulas.html

python scikit-learn linear-regression statsmodels

If you use statsmodels, I would highly recommend using the statsmodels formula interface instead. You will get the same old result from OLS using the statsmodels formula interface as you would from sklearn.linear_model.LinearRegression, or R, or SAS, or Excel.

smod = smf.ols(formula ='y~ x', data=df)result = smod.fit()print(result.summary())

When in doubt, please

try reading the source code
try a different language for benchmark, or
try OLS from scratch, which is basic linear algebra.

python scikit-learn linear-regression statsmodels

i just wanted to add here, that in terms of sklearn, it does not use OLS method for linear regression under the hood. Since sklearn comes from the data-mining/machine-learning realm, they like to use Steepest Descent Gradient algorithm. This is a numerical method that is sensitive to initial conditions etc, while the OLS is an analytical closed form approach, so one should expect differences. So statsmodels comes from classical statistics field hence they would use OLS technique. So there are differences between the two linear regressions from the 2 different libraries

CodeHunter

OLS Regression: Scikit vs. Statsmodels? [closed]

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last