predict.lm() with an unknown factor level in test data predict.lm() with an unknown factor level in test data r r

predict.lm() with an unknown factor level in test data


You have to remove the extra levels before any calculation, like:

> id <- which(!(foo.new$predictor %in% levels(foo$predictor)))> foo.new$predictor[id] <- NA> predict(model,newdata=foo.new)         1          2          3          4 -0.1676941 -0.6454521  0.4524391         NA 

This is a more general way of doing it, it will set all levels that do not occur in the original data to NA. As Hadley mentioned in the comments, they could have chosen to include this in the predict() function, but they didn't

Why you have to do that becomes obvious if you look at the calculation itself. Internally, the predictions are calculated as :

model.matrix(~predictor,data=foo) %*% coef(model)        [,1]1 -0.16769412 -0.64545213  0.4524391

At the bottom you have both model matrices. You see that the one for foo.new has an extra column, so you can't use the matrix calculation any more. If you would use the new dataset to model, you would also get a different model, being one with an extra dummy variable for the extra level.

> model.matrix(~predictor,data=foo)  (Intercept) predictorB predictorC1           1          0          02           1          1          03           1          0          1attr(,"assign")[1] 0 1 1attr(,"contrasts")attr(,"contrasts")$predictor[1] "contr.treatment"> model.matrix(~predictor,data=foo.new)  (Intercept) predictorB predictorC predictorD1           1          0          0          02           1          1          0          03           1          0          1          04           1          0          0          1attr(,"assign")[1] 0 1 1 1attr(,"contrasts")attr(,"contrasts")$predictor[1] "contr.treatment"

You can't just delete the last column from the model matrix either, because even if you do that, both other levels are still influenced. The code for level A would be (0,0). For B this is (1,0), for C this (0,1) ... and for D it is again (0,0)! So your model would assume that A and D are the same level if it would naively drop the last dummy variable.

On a more theoretical part: It is possible to build a model without having all the levels. Now, as I tried to explain before, that model is only valid for the levels you used when building the model. If you come across new levels, you have to build a new model to include the extra information. If you don't do that, the only thing you can do is delete the extra levels from the dataset. But then you basically lose all information that was contained in it, so it's generally not considered good practice.


If you want to deal with the missing levels in your data after creating your lm model but before calling predict (given we don't know exactly what levels might be missing beforehand) here is function I've built to set all levels not in the model to NA - the prediction will also then give NA and you can then use an alternative method to predict these values.

object will be your lm output from lm(...,data=trainData)

data will be the data frame you want to create predictions for

missingLevelsToNA<-function(object,data){  #Obtain factor predictors in the model and their levels ------------------  factors<-(gsub("[-^0-9]|as.factor|\\(|\\)", "",names(unlist(object$xlevels))))  factorLevels<-unname(unlist(object$xlevels))  modelFactors<-as.data.frame(cbind(factors,factorLevels))  #Select column names in your data that are factor predictors in your model -----  predictors<-names(data[names(data) %in% factors])  #For each factor predictor in your data if the level is not in the model set the value to NA --------------  for (i in 1:length(predictors)){    found<-data[,predictors[i]] %in% modelFactors[modelFactors$factors==predictors[i],]$factorLevels    if (any(!found)) data[!found,predictors[i]]<-NA  }  data}


Tidied and extended the function by MorgenBall. It is also implemented in sperrorest now.

Additional features

  • drops unused factor levels rather than just setting the missing values to NA.
  • issues a message to the user that factor levels have been dropped
  • checks for existence of factor variables in test_data and returns original data.frame if non are present
  • works not only for lm, glm and but also for glmmPQL

Note: The function shown here may change (improve) over time.

#' @title remove_missing_levels#' @description Accounts for missing factor levels present only in test data#' but not in train data by setting values to NA#'#' @import magrittr#' @importFrom gdata unmatrix#' @importFrom stringr str_split#'#' @param fit fitted model on training data#'#' @param test_data data to make predictions for#'#' @return data.frame with matching factor levels to fitted model#'#' @keywords internal#'#' @exportremove_missing_levels <- function(fit, test_data) {  # https://stackoverflow.com/a/39495480/4185785  # drop empty factor levels in test data  test_data %>%    droplevels() %>%    as.data.frame() -> test_data  # 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to  # account for it  if (any(class(fit) == "glmmPQL")) {    # Obtain factor predictors in the model and their levels    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",                     names(unlist(fit$contrasts))))    # do nothing if no factors are present    if (length(factors) == 0) {      return(test_data)    }    map(fit$contrasts, function(x) names(unmatrix(x))) %>%      unlist() -> factor_levels    factor_levels %>% str_split(":", simplify = TRUE) %>%      extract(, 1) -> factor_levels    model_factors <- as.data.frame(cbind(factors, factor_levels))  } else {    # Obtain factor predictors in the model and their levels    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",                     names(unlist(fit$xlevels))))    # do nothing if no factors are present    if (length(factors) == 0) {      return(test_data)    }    factor_levels <- unname(unlist(fit$xlevels))    model_factors <- as.data.frame(cbind(factors, factor_levels))  }  # Select column names in test data that are factor predictors in  # trained model  predictors <- names(test_data[names(test_data) %in% factors])  # For each factor predictor in your data, if the level is not in the model,  # set the value to NA  for (i in 1:length(predictors)) {    found <- test_data[, predictors[i]] %in% model_factors[      model_factors$factors == predictors[i], ]$factor_levels    if (any(!found)) {      # track which variable      var <- predictors[i]      # set to NA      test_data[!found, predictors[i]] <- NA      # drop empty factor levels in test data      test_data %>%        droplevels() -> test_data      # issue warning to console      message(sprintf(paste0("Setting missing levels in '%s', only present",                             " in test data but missing in train data,",                             " to 'NA'."),                      var))    }  }  return(test_data)}

We can apply this function to the example in the question as follows:

predict(model,newdata=remove_missing_levels (fit=model, test_data=foo.new))

While trying to improve this function, I came across the fact that SL learning methods like lm, glm etc. need the same levels in train & test while ML learning methods (svm, randomForest) fail if the levels are removed. These methods need all levels in train & test.

A general solution is quite hard to achieve since every fitted model has a different way of storing their factor level component (fit$xlevels for lm and fit$contrasts for glmmPQL). At least it seems to be consistent across lm related models.