Screening (multi)collinearity in a regression model Screening (multi)collinearity in a regression model r r

Screening (multi)collinearity in a regression model


The kappa() function can help. Here is a simulated example:

> set.seed(42)> x1 <- rnorm(100)> x2 <- rnorm(100)> x3 <- x1 + 2*x2 + rnorm(100)*0.0001    # so x3 approx a linear comb. of x1+x2> mm12 <- model.matrix(~ x1 + x2)        # normal model, two indep. regressors> mm123 <- model.matrix(~ x1 + x2 + x3)  # bad model with near collinearity> kappa(mm12)                            # a 'low' kappa is good[1] 1.166029> kappa(mm123)                           # a 'high' kappa indicates trouble[1] 121530.7

and we go further by making the third regressor more and more collinear:

> x4 <- x1 + 2*x2 + rnorm(100)*0.000001  # even more collinear> mm124 <- model.matrix(~ x1 + x2 + x4)> kappa(mm124)[1] 13955982> x5 <- x1 + 2*x2                        # now x5 is linear comb of x1,x2> mm125 <- model.matrix(~ x1 + x2 + x5)> kappa(mm125)[1] 1.067568e+16> 

This used approximations, see help(kappa) for details.


Just to add to what Dirk said about the Condition Number method, a rule of thumb is that values of CN > 30 indicate severe collinearity. Other methods, apart from condition number, include:

1) the determinant of the covariance matrix which ranges from 0 (Perfect Collinearity) to 1 (No Collinearity)

# using Dirk's example> det(cov(mm12[,-1]))[1] 0.8856818> det(cov(mm123[,-1]))[1] 8.916092e-09

2) Using the fact that the determinant of a diagonal matrix is the product of the eigenvalues => The presence of one or more small eigenvalues indicates collinearity

> eigen(cov(mm12[,-1]))$values[1] 1.0876357 0.8143184> eigen(cov(mm123[,-1]))$values[1] 5.388022e+00 9.862794e-01 1.677819e-09

3) The value of the Variance Inflation Factor (VIF). The VIF for predictor i is 1/(1-R_i^2), where R_i^2 is the R^2 from a regression of predictor i against the remaining predictors. Collinearity is present when VIF for at least one independent variable is large. Rule of Thumb: VIF > 10 is of concern. For an implementation in R see here. I would also like to comment that the use of R^2 for determining collinearity should go hand in hand with visual examination of the scatterplots because a single outlier can "cause" collinearity where it doesn't exist, or can HIDE collinearity where it exists.


You might like Vito Ricci's Reference Card "R Functions For Regression Analysis"http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf

It succinctly lists many useful regression related functions in R including diagnostic functions.In particular, it lists the vif function from the car package which can assess multicollinearity.http://en.wikipedia.org/wiki/Variance_inflation_factor

Consideration of multicollinearity often goes hand in hand with issues of assessing variable importance. If this applies to you, perhaps check out the relaimpo package: http://prof.beuth-hochschule.de/groemping/relaimpo/