How does glmnet's standardize argument handle dummy variables?
In short, yes - this will standardize the dummy variables, but there's a reason for doing so. The glmnet
function takes a matrix as an input for its X
parameter, not a data frame, so it doesn't make the distinction for factor
columns which you may have if the parameter was a data.frame
. If you take a look at the R function, glmnet codes the standardize
parameter internally as
isd = as.integer(standardize)
Which converts the R boolean to a 0 or 1 integer to feed to any of the internal FORTRAN functions (elnet, lognet, et. al.)
If you go even further by examining the FORTRAN code (fixed width - old school!), you'll see the following block:
subroutine standard1 (no,ni,x,y,w,isd,intr,ju,xm,xs,ym,ys,xv,jerr) 989 real x(no,ni),y(no),w(no),xm(ni),xs(ni),xv(ni) 989 integer ju(ni) 990 real, dimension (:), allocatable :: v allocate(v(1:no),stat=jerr) 993 if(jerr.ne.0) return 994 w=w/sum(w) 994 v=sqrt(w) 995 if(intr .ne. 0)goto 10651 995 ym=0.0 995 y=v*y 996 ys=sqrt(dot_product(y,y)-dot_product(v,y)**2) 996 y=y/ys 997 10660 do 10661 j=1,ni 997 if(ju(j).eq.0)goto 10661 997 xm(j)=0.0 997 x(:,j)=v*x(:,j) 998 xv(j)=dot_product(x(:,j),x(:,j)) 999 if(isd .eq. 0)goto 10681 999 xbq=dot_product(v,x(:,j))**2 999 vc=xv(j)-xbq 1000 xs(j)=sqrt(vc) 1000 x(:,j)=x(:,j)/xs(j) 1000 xv(j)=1.0+xbq/vc 1001 goto 10691 1002
Take a look at the lines marked 1000 - this is basically applying the standardization formula to the X
matrix.
Now statistically speaking, one does not generally standardize categorical variables to retain the interpretability of the estimated regressors. However, as pointed out by Tibshirani here, "The lasso method requires initial standardization of the regressors, so that the penalization scheme is fair to all regressors. For categorical regressors, one codes the regressor with dummy variables and then standardizes the dummy variables" - so while this causes arbitrary scaling between continuous and categorical variables, it's done for equal penalization treatment.
glmnet
doesn't know anything about dummy variables, because it doesn't have a formula interface (and hence doesn't touch model.frame
and model.matrix
.) If you want them to be treated specially, you'll have to do it yourself.