Run an OLS regression with Pandas Data Frame Run an OLS regression with Pandas Data Frame python python

Run an OLS regression with Pandas Data Frame


I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas' optional dependencies before pandas' version 0.20.0 (it was used for a few things in pandas.stats.)

>>> import pandas as pd>>> import statsmodels.formula.api as sm>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})>>> result = sm.ols(formula="A ~ B + C", data=df).fit()>>> print(result.params)Intercept    14.952480B             0.401182C             0.000352dtype: float64>>> print(result.summary())                            OLS Regression Results                            ==============================================================================Dep. Variable:                      A   R-squared:                       0.579Model:                            OLS   Adj. R-squared:                  0.158Method:                 Least Squares   F-statistic:                     1.375Date:                Thu, 14 Nov 2013   Prob (F-statistic):              0.421Time:                        20:04:30   Log-Likelihood:                -18.178No. Observations:                   5   AIC:                             42.36Df Residuals:                       2   BIC:                             41.19Df Model:                           2                                         ==============================================================================                 coef    std err          t      P>|t|      [95.0% Conf. Int.]------------------------------------------------------------------------------Intercept     14.9525     17.764      0.842      0.489       -61.481    91.386B              0.4012      0.650      0.617      0.600        -2.394     3.197C              0.0004      0.001      0.650      0.583        -0.002     0.003==============================================================================Omnibus:                          nan   Durbin-Watson:                   1.061Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.498Skew:                          -0.123   Prob(JB):                        0.780Kurtosis:                       1.474   Cond. No.                     5.21e+04==============================================================================Warnings:[1] The condition number is large, 5.21e+04. This might indicate that there arestrong multicollinearity or other numerical problems.


Note: pandas.stats has been removed with 0.20.0


It's possible to do this with pandas.stats.ols:

>>> from pandas.stats.api import ols>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})>>> res = ols(y=df['A'], x=df[['B','C']])>>> res-------------------------Summary of Regression Analysis-------------------------Formula: Y ~ <B> + <C> + <intercept>Number of Observations:         5Number of Degrees of Freedom:   3R-squared:         0.5789Adj R-squared:     0.1577Rmse:             14.5108F-stat (2, 2):     1.3746, p-value:     0.4211Degrees of Freedom: model 2, resid 2-----------------------Summary of Estimated Coefficients------------------------      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%--------------------------------------------------------------------------------             B     0.4012     0.6497       0.62     0.5999    -0.8723     1.6746             C     0.0004     0.0005       0.65     0.5826    -0.0007     0.0014     intercept    14.9525    17.7643       0.84     0.4886   -19.8655    49.7705---------------------------------End of Summary---------------------------------

Note that you need to have statsmodels package installed, it is used internally by the pandas.stats.ols function.


I don't know if this is new in sklearn or pandas, but I'm able to pass the data frame directly to sklearn without converting the data frame to a numpy array or any other data types.

from sklearn import linear_modelreg = linear_model.LinearRegression()reg.fit(df[['B', 'C']], df['A'])>>> reg.coef_array([  4.01182386e-01,   3.51587361e-04])