Find p-value (significance) in scikit-learn LinearRegression
This is kind of overkill but let's give it a go. First lets use statsmodel to find out what the p-values should be
import pandas as pdimport numpy as npfrom sklearn import datasets, linear_modelfrom sklearn.linear_model import LinearRegressionimport statsmodels.api as smfrom scipy import statsdiabetes = datasets.load_diabetes()X = diabetes.datay = diabetes.targetX2 = sm.add_constant(X)est = sm.OLS(y, X2)est2 = est.fit()print(est2.summary())
and we get
OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.518Model: OLS Adj. R-squared: 0.507Method: Least Squares F-statistic: 46.27Date: Wed, 08 Mar 2017 Prob (F-statistic): 3.83e-62Time: 10:08:24 Log-Likelihood: -2386.0No. Observations: 442 AIC: 4794.Df Residuals: 431 BIC: 4839.Df Model: 10 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975]------------------------------------------------------------------------------const 152.1335 2.576 59.061 0.000 147.071 157.196x1 -10.0122 59.749 -0.168 0.867 -127.448 107.424x2 -239.8191 61.222 -3.917 0.000 -360.151 -119.488x3 519.8398 66.534 7.813 0.000 389.069 650.610x4 324.3904 65.422 4.958 0.000 195.805 452.976x5 -792.1842 416.684 -1.901 0.058 -1611.169 26.801x6 476.7458 339.035 1.406 0.160 -189.621 1143.113x7 101.0446 212.533 0.475 0.635 -316.685 518.774x8 177.0642 161.476 1.097 0.273 -140.313 494.442x9 751.2793 171.902 4.370 0.000 413.409 1089.150x10 67.6254 65.984 1.025 0.306 -62.065 197.316==============================================================================Omnibus: 1.506 Durbin-Watson: 2.029Prob(Omnibus): 0.471 Jarque-Bera (JB): 1.404Skew: 0.017 Prob(JB): 0.496Kurtosis: 2.726 Cond. No. 227.==============================================================================
Ok, let's reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.
lm = LinearRegression()lm.fit(X,y)params = np.append(lm.intercept_,lm.coef_)predictions = lm.predict(X)newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))# Note if you don't want to use a DataFrame replace the two lines above with# newX = np.append(np.ones((len(X),1)), X, axis=1)# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())sd_b = np.sqrt(var_b)ts_b = params/ sd_bp_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]sd_b = np.round(sd_b,3)ts_b = np.round(ts_b,3)p_values = np.round(p_values,3)params = np.round(params,4)myDF3 = pd.DataFrame()myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]print(myDF3)
And this gives us.
Coefficients Standard Errors t values Probabilities0 152.1335 2.576 59.061 0.0001 -10.0122 59.749 -0.168 0.8672 -239.8191 61.222 -3.917 0.0003 519.8398 66.534 7.813 0.0004 324.3904 65.422 4.958 0.0005 -792.1842 416.684 -1.901 0.0586 476.7458 339.035 1.406 0.1607 101.0446 212.533 0.475 0.6358 177.0642 161.476 1.097 0.2739 751.2793 171.902 4.370 0.00010 67.6254 65.984 1.025 0.306
So we can reproduce the values from statsmodel.
scikit-learn's LinearRegression doesn't calculate this information but you can easily extend the class to do it:
from sklearn import linear_modelfrom scipy import statsimport numpy as npclass LinearRegression(linear_model.LinearRegression): """ LinearRegression class after sklearn's, but calculate t-statistics and p-values for model coefficients (betas). Additional attributes available after .fit() are `t` and `p` which are of the shape (y.shape[1], X.shape[1]) which is (n_features, n_coefs) This class sets the intercept to 0 by default, since usually we include it in X. """ def __init__(self, *args, **kwargs): if not "fit_intercept" in kwargs: kwargs['fit_intercept'] = False super(LinearRegression, self)\ .__init__(*args, **kwargs) def fit(self, X, y, n_jobs=1): self = super(LinearRegression, self).fit(X, y, n_jobs) sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1]) se = np.array([ np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X)))) for i in range(sse.shape[0]) ]) self.t = self.coef_ / se self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1])) return self
Stolen from here.
You should take a look at statsmodels for this kind of statistical analysis in Python.
EDIT: This is not the right way to do it, see comments below. Others might make the same mistake, which is why I think it is good to keep this answer
You could use sklearn.feature_selection.f_regression.