Run an OLS regression with Pandas Data Frame

python pandas scikit-learn regression statsmodels

I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas' optional dependencies before pandas' version 0.20.0 (it was used for a few things in pandas.stats.)

>>> import pandas as pd>>> import statsmodels.formula.api as sm>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})>>> result = sm.ols(formula="A ~ B + C", data=df).fit()>>> print(result.params)Intercept    14.952480B             0.401182C             0.000352dtype: float64>>> print(result.summary())                            OLS Regression Results                            ==============================================================================Dep. Variable:                      A   R-squared:                       0.579Model:                            OLS   Adj. R-squared:                  0.158Method:                 Least Squares   F-statistic:                     1.375Date:                Thu, 14 Nov 2013   Prob (F-statistic):              0.421Time:                        20:04:30   Log-Likelihood:                -18.178No. Observations:                   5   AIC:                             42.36Df Residuals:                       2   BIC:                             41.19Df Model:                           2                                         ==============================================================================                 coef    std err          t      P>|t|      [95.0% Conf. Int.]------------------------------------------------------------------------------Intercept     14.9525     17.764      0.842      0.489       -61.481    91.386B              0.4012      0.650      0.617      0.600        -2.394     3.197C              0.0004      0.001      0.650      0.583        -0.002     0.003==============================================================================Omnibus:                          nan   Durbin-Watson:                   1.061Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.498Skew:                          -0.123   Prob(JB):                        0.780Kurtosis:                       1.474   Cond. No.                     5.21e+04==============================================================================Warnings:[1] The condition number is large, 5.21e+04. This might indicate that there arestrong multicollinearity or other numerical problems.

python pandas scikit-learn regression statsmodels

Note: pandas.stats has been removed with 0.20.0

It's possible to do this with pandas.stats.ols:

>>> from pandas.stats.api import ols>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})>>> res = ols(y=df['A'], x=df[['B','C']])>>> res-------------------------Summary of Regression Analysis-------------------------Formula: Y ~ <B> + <C> + <intercept>Number of Observations:         5Number of Degrees of Freedom:   3R-squared:         0.5789Adj R-squared:     0.1577Rmse:             14.5108F-stat (2, 2):     1.3746, p-value:     0.4211Degrees of Freedom: model 2, resid 2-----------------------Summary of Estimated Coefficients------------------------      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%--------------------------------------------------------------------------------             B     0.4012     0.6497       0.62     0.5999    -0.8723     1.6746             C     0.0004     0.0005       0.65     0.5826    -0.0007     0.0014     intercept    14.9525    17.7643       0.84     0.4886   -19.8655    49.7705---------------------------------End of Summary---------------------------------

Note that you need to have statsmodels package installed, it is used internally by the pandas.stats.ols function.

python pandas scikit-learn regression statsmodels

I don't know if this is new in sklearn or pandas, but I'm able to pass the data frame directly to sklearn without converting the data frame to a numpy array or any other data types.

from sklearn import linear_modelreg = linear_model.LinearRegression()reg.fit(df[['B', 'C']], df['A'])>>> reg.coef_array([  4.01182386e-01,   3.51587361e-04])

CodeHunter

Run an OLS regression with Pandas Data Frame

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last