Stepwise regression using p-values to drop variables with nonsignificant p-values Stepwise regression using p-values to drop variables with nonsignificant p-values r r

Stepwise regression using p-values to drop variables with nonsignificant p-values


Show your boss the following :

set.seed(100)x1 <- runif(100,0,1)x2 <- as.factor(sample(letters[1:3],100,replace=T))y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100)summary(lm(y~x1*x2))

Which gives :

            Estimate Std. Error t value Pr(>|t|)    (Intercept)  -0.1525     0.3066  -0.498  0.61995    x1            1.8693     0.6045   3.092  0.00261 ** x2b           2.5149     0.4334   5.802 8.77e-08 ***x2c           0.3089     0.4475   0.690  0.49180    x1:x2b       -1.1239     0.8022  -1.401  0.16451    x1:x2c       -1.0497     0.7873  -1.333  0.18566    ---Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1 

Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time.


Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables.

I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you.

###################################### Automated model selection# Author      : Joris Meys# version     : 0.2# date        : 12/01/09######################################CHANGE LOG# 0.2   : check for empty scopevar vector###################################### Function has.interaction checks whether x is part of a term in terms# terms is a vector with names of terms from a modelhas.interaction <- function(x,terms){    out <- sapply(terms,function(i){        sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0    })    return(sum(out)>0)}# Function Model.select# model is the lm object of the full model# keep is a list of model terms to keep in the model at all times# sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS)# verbose=T gives the F-tests, dropped var and resulting model after model.select <- function(model,keep,sig=0.05,verbose=F){      counter=1      # check input      if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm object\n"))      # calculate scope for drop1 function      terms <- attr(model$terms,"term.labels")      if(missing(keep)){ # set scopevars to all terms          scopevars <- terms      } else{            # select the scopevars if keep is used          index <- match(keep,terms)          # check if all is specified correctly          if(sum(is.na(index))>0){              novar <- keep[is.na(index)]              warning(paste(                  c(novar,"cannot be found in the model",                  "\nThese terms are ignored in the model selection."),                  collapse=" "))              index <- as.vector(na.omit(index))          }          scopevars <- terms[-index]      }      # Backward model selection :       while(T){          # extract the test statistics from drop.          test <- drop1(model, scope=scopevars,test="F")          if(verbose){              cat("-------------STEP ",counter,"-------------\n",              "The drop statistics : \n")              print(test)          }          pval <- test[,dim(test)[2]]          names(pval) <- rownames(test)          pval <- sort(pval,decreasing=T)          if(sum(is.na(pval))>0) stop(paste("Model",              deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))          # check if all significant          if(pval[1]<sig) break # stops the loop if all remaining vars are sign.          # select var to drop          i=1          while(T){              dropvar <- names(pval)[i]              check.terms <- terms[-match(dropvar,terms)]              x <- has.interaction(dropvar,check.terms)              if(x){i=i+1;next} else {break}                        } # end while(T) drop var          if(pval[i]<sig) break # stops the loop if var to remove is significant          if(verbose){             cat("\n--------\nTerm dropped in step",counter,":",dropvar,"\n--------\n\n")                        }          #update terms, scopevars and model          scopevars <- scopevars[-match(dropvar,scopevars)]          terms <- terms[-match(dropvar,terms)]          formul <- as.formula(paste(".~.-",dropvar))          model <- update(model,formul)          if(length(scopevars)==0) {              warning("All variables are thrown out of the model.\n",              "No model could be specified.")              return()          }          counter=counter+1      } # end while(T) main loop      return(model)}


Why not try using the step() function specifying your testing method?

For example, for backward elimination, you type only a command:

step(FullModel, direction = "backward", test = "F")

and for stepwise selection, simply:

step(FullModel, direction = "both", test = "F")

This can display both the AIC values as well as the F and P values.


Here is an example. Start with the most complicated model: this includes interactions between all three explanatory variables.

model1 <-lm (ozone~temp*wind*rad)summary(model1)Coefficients:Estimate Std.Error t value Pr(>t)(Intercept) 5.683e+02 2.073e+02 2.741 0.00725 **temp          -1.076e+01 4.303e+00 -2.501 0.01401 *wind          -3.237e+01 1.173e+01 -2.760 0.00687 **rad           -3.117e-01 5.585e-01 -0.558 0.57799temp:wind      2.377e-01 1.367e-01 1.739 0.08519   temp:rad       8.402e-03 7.512e-03 1.119 0.26602wind:rad       2.054e-02 4.892e-02 0.420 0.47552temp:wind:rad -4.324e-04 6.595e-04 -0.656 0.51358

The three-way interaction is clearly not significant. This is how you remove it, to begin the process of model simplification:

model2 <- update(model1,~. - temp:wind:rad)summary(model2)

Depending on the results, you can continue simplifying your model:

model3 <- update(model2,~. - temp:rad)summary(model3)...

Alternatively you can use the automatic model simplification function step, to seehow well it does:

model_step <- step(model1)