How do I handle multiple kinds of missingness in R? How do I handle multiple kinds of missingness in R? r r

How do I handle multiple kinds of missingness in R?


I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.

A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.

eg :

NACode <- function(x,code){    Df <- sapply(x,function(i){        i[i %in% code] <- NA        i    })    id <- which(is.na(Df))    rowid <- id %% nrow(x)    colid <- id %/% nrow(x) + 1    NAdf <- data.frame(        id,rowid,colid,        value = as.matrix(x)[id]    )    Df <- as.data.frame(Df)    attr(Df,"NAcode") <- NAdf    Df}

This allows to do :

> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)> DfwithNA <- NACode(Df,code)> str(DfwithNA)'data.frame':   10 obs. of  2 variables: $ A: num  1 2 3 4 5 6 7 8 9 10 $ B: num  1 2 3 4 5 NA NA NA 9 10 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:  ..$ id   : int  16 17 18  ..$ rowid: int  6 7 8  ..$ colid: num  2 2 2  ..$ value: num  -1 -2 -3

The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :

ChangeNAToCode <- function(x,code){    NAval <- attr(x,"NAcode")    for(i in which(NAval$value %in% code))        x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]    x}> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))> str(Dfback)'data.frame':   10 obs. of  2 variables: $ A: num  1 2 3 4 5 6 7 8 9 10 $ B: num  1 2 3 4 5 NA -2 -3 9 10 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:  ..$ id   : int  16 17 18  ..$ rowid: int  6 7 8  ..$ colid: num  2 2 2  ..$ value: num  -1 -2 -3

This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.

But in one line : using attributes and indices might be a nice way of doing it.


The most obvious way seems to use two vectors:

  • Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
  • Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.

Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.

Update following questions from @gsk3

  1. Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
  2. Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
  3. now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
  4. There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.

I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.


This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:

  • MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
  • MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
  • MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.

IMHO this question is more suitable for CrossValidated.

But here's a link from SO that you may find useful:

Handling missing/incomplete data in R--is there function to mask but not remove NAs?