Specify different types of missing values (NAs)
To my knowledge, base R doesn't have an in-built way to handle different NA
types. (editor: It does: NA_integer_
, NA_real_
, NA_complex_
, and NA_character
. See ?base::NA
.)
One option is to use a package which does so, for instance "memisc". It's a little bit of extra work, but it seems to do what you're looking for.
Here's an example:
First, your data. I've made a copy since we will be making some pretty significant changes to the dataset, and it's always nice to have a backup.
set.seed(667) df <- data.frame(a = sample(c("Don't know/Not sure", "Unknown", "Refused", "Blue", "Red", "Green"), 20, replace = TRUE), b = sample(c(1, 2, 3, 77, 88, 99), 10, replace = TRUE), f = round(rnorm(n = 10, mean = .90, sd = .08), digits = 2), g = sample(c("C", "M", "Y", "K"), 10, replace = TRUE))df2 <- df
Let's factor variable "a":
df2$a <- factor(df2$a, levels = c("Blue", "Red", "Green", "Don't know/Not sure", "Refused", "Unknown"), labels = c(1, 2, 3, 77, 88, 99))
Load the "memisc" library:
library(memisc)
Now, convert variables "a" and "b" to item
s in "memisc":
df2$a <- as.item(as.character(df2$a), labels = structure(c(1, 2, 3, 77, 88, 99), names = c("Blue", "Red", "Green", "Don't know/Not sure", "Refused", "Unknown")), missing.values = c(77, 88, 99))df2$b <- as.item(df2$b, labels = c(1, 2, 3, 77, 88, 99), missing.values = c(77, 88, 99))
By doing this, we have a new data type. Compare the following:
as.factor(df2$a)# [1] <NA> <NA> Red Red Green Green Red Green <NA> <NA> Blue # [12] Green Blue <NA> <NA> <NA> Blue Green <NA> Red # Levels: Blue Red Greenas.factor(include.missings(df2$a))# [1] *Unknown *Refused Red # [4] Red Green Green # [7] Red Green *Unknown # [10] *Refused Blue Green # [13] Blue *Don't know/Not sure *Unknown # [16] *Refused Blue Green # [19] *Refused Red # Levels: Blue Red Green *Don't know/Not sure *Refused *Unknown
We can use this information to create tables behaving the way you describe, while retaining all the original information.
table(as.factor(include.missings(df2$a)), df2$g)# # C K M Y# Blue 0 0 1 2# Red 1 0 0 3# Green 2 1 2 0# *Don't know/Not sure 0 0 0 1# *Refused 1 1 2 0# *Unknown 0 0 3 0table(as.factor(df2$a), df2$g)# # C K M Y# Blue 0 0 1 2# Red 1 0 0 3# Green 2 1 2 0table(as.factor(df2$a), df2$g, useNA="always")# # C K M Y <NA># Blue 0 0 1 2 0# Red 1 0 0 3 0# Green 2 1 2 0 0# <NA> 1 1 5 1 0
The tables for the numeric column with missing data behaves the same way.
table(as.factor(include.missings(df2$b)), df2$g)# # C K M Y# 1 0 0 0 0# 2 0 0 4 0# 3 0 2 0 2# *77 0 0 2 2# *88 2 0 0 0# *99 2 0 2 2table(as.factor(df2$b), df2$g, useNA="always")# # C K M Y <NA># 1 0 0 0 0 0# 2 0 0 4 0 0# 3 0 2 0 2 0# <NA> 4 0 4 4 0
As a bonus, you get the facility to generate nice codebook
s:
> codebook(df2$a)======================================================================== df2$a------------------------------------------------------------------------ Storage mode: character Measurement: nominal Missing values: 77, 88, 99 Values and labels N Percent 1 'Blue' 3 25.0 15.0 2 'Red' 4 33.3 20.0 3 'Green' 5 41.7 25.0 77 M 'Don't know/Not sure' 1 5.0 88 M 'Refused' 4 20.0 99 M 'Unknown' 3 15.0
However, I do also suggest you read the comment from @Maxim.K about what really constitutes missing values.
To retain the original values, you can create new columns where you code the NA information , for example :
df <- transform(df,b.na = ifelse(b %in% c('77','88','99'),NA,b))df <- transform(df,a.na = ifelse(a %in% c("Don't know/Not sure","Unknown","Refused"),NA,a))
Then you can do something like this :
table(df$b.na , df$g) C K M Y 2 0 0 4 0 3 0 2 0 2
Another option without creating new columns is to use ,exclude
option like this , to set the non desired values to NULL,( different of missing values)
table(df$a,df$g, exclude=c('77','88','99',"Don't know/Not sure","Unknown","Refused")) C K M Y Blue 0 0 1 2 Green 2 1 2 0 Red 1 0 0 3
You can define some global constants( even it is not recommnded ) to group your "missing values", and use them in the rest of your program. Something like this :
B_MISSING <- c('77','88','99')A_MISSING <- c("Don't know/Not sure","Unknown","Refused")
If you are willing to stick to numeric values then NA
, Inf
, -Inf
, and NaN
could be used for different missing values. You can then use is.finite
to distinguish between them and normal values:
x <- c(NA, Inf, -Inf, NaN, 1)is.finite(x)## [1] FALSE FALSE FALSE FALSE TRUE
is.infinite
, is.nan
and is.na
are also useful here.
We could have a special print function that displays them in a more meaningful way or even create a special class but even without that the above would divide the data into finite and multiple non-finite values.