Run an OLS regression with Pandas Data Frame
I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas
' optional dependencies before pandas
' version 0.20.0 (it was used for a few things in pandas.stats
.)
>>> import pandas as pd>>> import statsmodels.formula.api as sm>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})>>> result = sm.ols(formula="A ~ B + C", data=df).fit()>>> print(result.params)Intercept 14.952480B 0.401182C 0.000352dtype: float64>>> print(result.summary()) OLS Regression Results ==============================================================================Dep. Variable: A R-squared: 0.579Model: OLS Adj. R-squared: 0.158Method: Least Squares F-statistic: 1.375Date: Thu, 14 Nov 2013 Prob (F-statistic): 0.421Time: 20:04:30 Log-Likelihood: -18.178No. Observations: 5 AIC: 42.36Df Residuals: 2 BIC: 41.19Df Model: 2 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386B 0.4012 0.650 0.617 0.600 -2.394 3.197C 0.0004 0.001 0.650 0.583 -0.002 0.003==============================================================================Omnibus: nan Durbin-Watson: 1.061Prob(Omnibus): nan Jarque-Bera (JB): 0.498Skew: -0.123 Prob(JB): 0.780Kurtosis: 1.474 Cond. No. 5.21e+04==============================================================================Warnings:[1] The condition number is large, 5.21e+04. This might indicate that there arestrong multicollinearity or other numerical problems.
Note: pandas.stats
has been removed with 0.20.0
It's possible to do this with pandas.stats.ols
:
>>> from pandas.stats.api import ols>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})>>> res = ols(y=df['A'], x=df[['B','C']])>>> res-------------------------Summary of Regression Analysis-------------------------Formula: Y ~ <B> + <C> + <intercept>Number of Observations: 5Number of Degrees of Freedom: 3R-squared: 0.5789Adj R-squared: 0.1577Rmse: 14.5108F-stat (2, 2): 1.3746, p-value: 0.4211Degrees of Freedom: model 2, resid 2-----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%-------------------------------------------------------------------------------- B 0.4012 0.6497 0.62 0.5999 -0.8723 1.6746 C 0.0004 0.0005 0.65 0.5826 -0.0007 0.0014 intercept 14.9525 17.7643 0.84 0.4886 -19.8655 49.7705---------------------------------End of Summary---------------------------------
Note that you need to have statsmodels
package installed, it is used internally by the pandas.stats.ols
function.
I don't know if this is new in sklearn
or pandas
, but I'm able to pass the data frame directly to sklearn
without converting the data frame to a numpy array or any other data types.
from sklearn import linear_modelreg = linear_model.LinearRegression()reg.fit(df[['B', 'C']], df['A'])>>> reg.coef_array([ 4.01182386e-01, 3.51587361e-04])