Find p-value (significance) in scikit-learn LinearRegression Find p-value (significance) in scikit-learn LinearRegression python python

Find p-value (significance) in scikit-learn LinearRegression


This is kind of overkill but let's give it a go. First lets use statsmodel to find out what the p-values should be

import pandas as pdimport numpy as npfrom sklearn import datasets, linear_modelfrom sklearn.linear_model import LinearRegressionimport statsmodels.api as smfrom scipy import statsdiabetes = datasets.load_diabetes()X = diabetes.datay = diabetes.targetX2 = sm.add_constant(X)est = sm.OLS(y, X2)est2 = est.fit()print(est2.summary())

and we get

                         OLS Regression Results                            ==============================================================================Dep. Variable:                      y   R-squared:                       0.518Model:                            OLS   Adj. R-squared:                  0.507Method:                 Least Squares   F-statistic:                     46.27Date:                Wed, 08 Mar 2017   Prob (F-statistic):           3.83e-62Time:                        10:08:24   Log-Likelihood:                -2386.0No. Observations:                 442   AIC:                             4794.Df Residuals:                     431   BIC:                             4839.Df Model:                          10                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975]------------------------------------------------------------------------------const        152.1335      2.576     59.061      0.000     147.071     157.196x1           -10.0122     59.749     -0.168      0.867    -127.448     107.424x2          -239.8191     61.222     -3.917      0.000    -360.151    -119.488x3           519.8398     66.534      7.813      0.000     389.069     650.610x4           324.3904     65.422      4.958      0.000     195.805     452.976x5          -792.1842    416.684     -1.901      0.058   -1611.169      26.801x6           476.7458    339.035      1.406      0.160    -189.621    1143.113x7           101.0446    212.533      0.475      0.635    -316.685     518.774x8           177.0642    161.476      1.097      0.273    -140.313     494.442x9           751.2793    171.902      4.370      0.000     413.409    1089.150x10           67.6254     65.984      1.025      0.306     -62.065     197.316==============================================================================Omnibus:                        1.506   Durbin-Watson:                   2.029Prob(Omnibus):                  0.471   Jarque-Bera (JB):                1.404Skew:                           0.017   Prob(JB):                        0.496Kurtosis:                       2.726   Cond. No.                         227.==============================================================================

Ok, let's reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.

lm = LinearRegression()lm.fit(X,y)params = np.append(lm.intercept_,lm.coef_)predictions = lm.predict(X)newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))# Note if you don't want to use a DataFrame replace the two lines above with# newX = np.append(np.ones((len(X),1)), X, axis=1)# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())sd_b = np.sqrt(var_b)ts_b = params/ sd_bp_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]sd_b = np.round(sd_b,3)ts_b = np.round(ts_b,3)p_values = np.round(p_values,3)params = np.round(params,4)myDF3 = pd.DataFrame()myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]print(myDF3)

And this gives us.

    Coefficients  Standard Errors  t values  Probabilities0       152.1335            2.576    59.061         0.0001       -10.0122           59.749    -0.168         0.8672      -239.8191           61.222    -3.917         0.0003       519.8398           66.534     7.813         0.0004       324.3904           65.422     4.958         0.0005      -792.1842          416.684    -1.901         0.0586       476.7458          339.035     1.406         0.1607       101.0446          212.533     0.475         0.6358       177.0642          161.476     1.097         0.2739       751.2793          171.902     4.370         0.00010       67.6254           65.984     1.025         0.306

So we can reproduce the values from statsmodel.


scikit-learn's LinearRegression doesn't calculate this information but you can easily extend the class to do it:

from sklearn import linear_modelfrom scipy import statsimport numpy as npclass LinearRegression(linear_model.LinearRegression):    """    LinearRegression class after sklearn's, but calculate t-statistics    and p-values for model coefficients (betas).    Additional attributes available after .fit()    are `t` and `p` which are of the shape (y.shape[1], X.shape[1])    which is (n_features, n_coefs)    This class sets the intercept to 0 by default, since usually we include it    in X.    """    def __init__(self, *args, **kwargs):        if not "fit_intercept" in kwargs:            kwargs['fit_intercept'] = False        super(LinearRegression, self)\                .__init__(*args, **kwargs)    def fit(self, X, y, n_jobs=1):        self = super(LinearRegression, self).fit(X, y, n_jobs)        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])        se = np.array([            np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X))))                                                    for i in range(sse.shape[0])                    ])        self.t = self.coef_ / se        self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))        return self

Stolen from here.

You should take a look at statsmodels for this kind of statistical analysis in Python.


EDIT: This is not the right way to do it, see comments below. Others might make the same mistake, which is why I think it is good to keep this answer

You could use sklearn.feature_selection.f_regression.

click here for the scikit-learn page