How do I handle multiple kinds of missingness in R?

r data-structures stata survey missing-data

I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.

A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.

eg :

NACode <- function(x,code){    Df <- sapply(x,function(i){        i[i %in% code] <- NA        i    })    id <- which(is.na(Df))    rowid <- id %% nrow(x)    colid <- id %/% nrow(x) + 1    NAdf <- data.frame(        id,rowid,colid,        value = as.matrix(x)[id]    )    Df <- as.data.frame(Df)    attr(Df,"NAcode") <- NAdf    Df}

This allows to do :

> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)> DfwithNA <- NACode(Df,code)> str(DfwithNA)'data.frame':   10 obs. of  2 variables: $ A: num  1 2 3 4 5 6 7 8 9 10 $ B: num  1 2 3 4 5 NA NA NA 9 10 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:  ..$ id   : int  16 17 18  ..$ rowid: int  6 7 8  ..$ colid: num  2 2 2  ..$ value: num  -1 -2 -3

The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :

ChangeNAToCode <- function(x,code){    NAval <- attr(x,"NAcode")    for(i in which(NAval$value %in% code))        x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]    x}> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))> str(Dfback)'data.frame':   10 obs. of  2 variables: $ A: num  1 2 3 4 5 6 7 8 9 10 $ B: num  1 2 3 4 5 NA -2 -3 9 10 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:  ..$ id   : int  16 17 18  ..$ rowid: int  6 7 8  ..$ colid: num  2 2 2  ..$ value: num  -1 -2 -3

This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.

But in one line : using attributes and indices might be a nice way of doing it.

r data-structures stata survey missing-data

The most obvious way seems to use two vectors:

Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.

Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.

Update following questions from @gsk3

Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.

I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.

r data-structures stata survey missing-data

This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:

MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.

IMHO this question is more suitable for CrossValidated.

But here's a link from SO that you may find useful:

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

CodeHunter

How do I handle multiple kinds of missingness in R?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last