Specify different types of missing values (NAs) Specify different types of missing values (NAs) r r

Specify different types of missing values (NAs)


To my knowledge, base R doesn't have an in-built way to handle different NA types. (editor: It does: NA_integer_, NA_real_, NA_complex_, and NA_character. See ?base::NA.)

One option is to use a package which does so, for instance "memisc". It's a little bit of extra work, but it seems to do what you're looking for.

Here's an example:

First, your data. I've made a copy since we will be making some pretty significant changes to the dataset, and it's always nice to have a backup.

set.seed(667) df <- data.frame(a = sample(c("Don't know/Not sure", "Unknown",                               "Refused", "Blue", "Red", "Green"),                            20, replace = TRUE),                  b = sample(c(1, 2, 3, 77, 88, 99), 10,                             replace = TRUE),                  f = round(rnorm(n = 10, mean = .90, sd = .08),                            digits = 2),                  g = sample(c("C", "M", "Y", "K"), 10,                             replace = TRUE))df2 <- df

Let's factor variable "a":

df2$a <- factor(df2$a,                 levels = c("Blue", "Red", "Green",                            "Don't know/Not sure",                           "Refused", "Unknown"),                labels = c(1, 2, 3, 77, 88, 99))

Load the "memisc" library:

library(memisc)

Now, convert variables "a" and "b" to items in "memisc":

df2$a <- as.item(as.character(df2$a),                   labels = structure(c(1, 2, 3, 77, 88, 99),                                     names = c("Blue", "Red", "Green",                                                "Don't know/Not sure",                                               "Refused", "Unknown")),                  missing.values = c(77, 88, 99))df2$b <- as.item(df2$b,                  labels = c(1, 2, 3, 77, 88, 99),                  missing.values = c(77, 88, 99))

By doing this, we have a new data type. Compare the following:

as.factor(df2$a)#  [1] <NA>  <NA>  Red   Red   Green Green Red   Green <NA>  <NA>  Blue # [12] Green Blue  <NA>  <NA>  <NA>  Blue  Green <NA>  Red  # Levels: Blue Red Greenas.factor(include.missings(df2$a))#  [1] *Unknown             *Refused             Red                 #  [4] Red                  Green                Green               #  [7] Red                  Green                *Unknown            # [10] *Refused             Blue                 Green               # [13] Blue                 *Don't know/Not sure *Unknown            # [16] *Refused             Blue                 Green               # [19] *Refused             Red                 # Levels: Blue Red Green *Don't know/Not sure *Refused *Unknown

We can use this information to create tables behaving the way you describe, while retaining all the original information.

table(as.factor(include.missings(df2$a)), df2$g)#                       #                        C K M Y#   Blue                 0 0 1 2#   Red                  1 0 0 3#   Green                2 1 2 0#   *Don't know/Not sure 0 0 0 1#   *Refused             1 1 2 0#   *Unknown             0 0 3 0table(as.factor(df2$a), df2$g)#        #         C K M Y#   Blue  0 0 1 2#   Red   1 0 0 3#   Green 2 1 2 0table(as.factor(df2$a), df2$g, useNA="always")#        #         C K M Y <NA>#   Blue  0 0 1 2    0#   Red   1 0 0 3    0#   Green 2 1 2 0    0#   <NA>  1 1 5 1    0

The tables for the numeric column with missing data behaves the same way.

table(as.factor(include.missings(df2$b)), df2$g)#      #       C K M Y#   1   0 0 0 0#   2   0 0 4 0#   3   0 2 0 2#   *77 0 0 2 2#   *88 2 0 0 0#   *99 2 0 2 2table(as.factor(df2$b), df2$g, useNA="always")#       #        C K M Y <NA>#   1    0 0 0 0    0#   2    0 0 4 0    0#   3    0 2 0 2    0#   <NA> 4 0 4 4    0

As a bonus, you get the facility to generate nice codebooks:

> codebook(df2$a)========================================================================   df2$a------------------------------------------------------------------------   Storage mode: character   Measurement: nominal   Missing values: 77, 88, 99            Values and labels    N    Percent     1   'Blue'                   3   25.0 15.0    2   'Red'                    4   33.3 20.0    3   'Green'                  5   41.7 25.0   77 M 'Don't know/Not sure'    1         5.0   88 M 'Refused'                4        20.0   99 M 'Unknown'                3        15.0

However, I do also suggest you read the comment from @Maxim.K about what really constitutes missing values.


To retain the original values, you can create new columns where you code the NA information , for example :

df <- transform(df,b.na = ifelse(b %in% c('77','88','99'),NA,b))df <- transform(df,a.na = ifelse(a %in%                         c("Don't know/Not sure","Unknown","Refused"),NA,a))

Then you can do something like this :

   table(df$b.na , df$g)    C K M Y  2 0 0 4 0  3 0 2 0 2

Another option without creating new columns is to use ,exclude option like this , to set the non desired values to NULL,( different of missing values)

table(df$a,df$g,      exclude=c('77','88','99',"Don't know/Not sure","Unknown","Refused"))        C K M Y  Blue  0 0 1 2  Green 2 1 2 0  Red   1 0 0 3

You can define some global constants( even it is not recommnded ) to group your "missing values", and use them in the rest of your program. Something like this :

B_MISSING <- c('77','88','99')A_MISSING <- c("Don't know/Not sure","Unknown","Refused")


If you are willing to stick to numeric values then NA, Inf, -Inf, and NaN could be used for different missing values. You can then use is.finite to distinguish between them and normal values:

x <- c(NA, Inf, -Inf, NaN, 1)is.finite(x)## [1] FALSE FALSE FALSE FALSE  TRUE

is.infinite, is.nan and is.na are also useful here.

We could have a special print function that displays them in a more meaningful way or even create a special class but even without that the above would divide the data into finite and multiple non-finite values.