Can dplyr package be used for conditional mutating? Can dplyr package be used for conditional mutating? r r

Can dplyr package be used for conditional mutating?


Use ifelse

df %>%  mutate(g = ifelse(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,               ifelse(a == 0 | a == 1 | a == 4 | a == 3 |  c == 4, 3, NA)))

Added - if_else: Note that in dplyr 0.5 there is an if_else function defined so an alternative would be to replace ifelse with if_else; however, note that since if_else is stricter than ifelse (both legs of the condition must have the same type) so the NA in that case would have to be replaced with NA_real_ .

df %>%  mutate(g = if_else(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,               if_else(a == 0 | a == 1 | a == 4 | a == 3 |  c == 4, 3, NA_real_)))

Added - case_when Since this question was posted dplyr has added case_when so another alternative would be:

df %>% mutate(g = case_when(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4) ~ 2,                            a == 0 | a == 1 | a == 4 | a == 3 |  c == 4 ~ 3,                            TRUE ~ NA_real_))

Added - arithmetic/na_if If the values are numeric and the conditions (except for the default value of NA at the end) are mutually exclusive, as is the case in the question, then we can use an arithmetic expression such that each term is multiplied by the desired result using na_if at the end to replace 0 with NA.

df %>%  mutate(g = 2 * (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)) +             3 * (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4),         g = na_if(g, 0))


Since you ask for other better ways to handle the problem, here's another way using data.table:

require(data.table) ## 1.9.2+setDT(df)df[a %in% c(0,1,3,4) | c == 4, g := 3L]df[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]

Note the order of conditional statements is reversed to get g correctly. There's no copy of g made, even during the second assignment - it's replaced in-place.

On larger data this would have better performance than using nested if-else, as it can evaluate both 'yes' and 'no' cases, and nesting can get harder to read/maintain IMHO.


Here's a benchmark on relatively bigger data:

# R version 3.1.0require(data.table) ## 1.9.2require(dplyr)DT <- setDT(lapply(1:6, function(x) sample(7, 1e7, TRUE)))setnames(DT, letters[1:6])# > dim(DT) # [1] 10000000        6DF <- as.data.frame(DT)DT_fun <- function(DT) {    DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]    DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]}DPLYR_fun <- function(DF) {    mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,             ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))}BASE_fun <- function(DF) { # R v3.1.0    transform(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,             ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))}system.time(ans1 <- DT_fun(DT))#   user  system elapsed #  2.659   0.420   3.107 system.time(ans2 <- DPLYR_fun(DF))#   user  system elapsed # 11.822   1.075  12.976 system.time(ans3 <- BASE_fun(DF))#   user  system elapsed # 11.676   1.530  13.319 identical(as.data.frame(ans1), as.data.frame(ans2))# [1] TRUEidentical(as.data.frame(ans1), as.data.frame(ans3))# [1] TRUE

Not sure if this is an alternative you'd asked for, but I hope it helps.


dplyr now has a function case_when that offers a vectorised if. The syntax is a little strange compared to mosaic:::derivedFactor as you cannot access variables in the standard dplyr way, and need to declare the mode of NA, but it is considerably faster than mosaic:::derivedFactor.

df %>%mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L,                      a %in% c(0,1,3,4) | c == 4 ~ 3L,                      TRUE~as.integer(NA)))

EDIT: If you're using dplyr::case_when() from before version 0.7.0 of the package, then you need to precede variable names with '.$' (e.g. write .$a == 1 inside case_when).

Benchmark:For the benchmark (reusing functions from Arun 's post) and reducing sample size:

require(data.table) require(mosaic) require(dplyr)require(microbenchmark)set.seed(42) # To recreate the dataframeDT <- setDT(lapply(1:6, function(x) sample(7, 10000, TRUE)))setnames(DT, letters[1:6])DF <- as.data.frame(DT)DPLYR_case_when <- function(DF) {  DF %>%  mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L,                        a %in% c(0,1,3,4) | c==4 ~ 3L,                        TRUE~as.integer(NA)))}DT_fun <- function(DT) {  DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]  DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]}DPLYR_fun <- function(DF) {  mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,                     ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))}mosa_fun <- function(DF) {  mutate(DF, g = derivedFactor(    "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),    "3" = (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4),    .method = "first",    .default = NA  ))}perf_results <- microbenchmark(  dt_fun <- DT_fun(copy(DT)),  dplyr_ifelse <- DPLYR_fun(copy(DF)),  dplyr_case_when <- DPLYR_case_when(copy(DF)),  mosa <- mosa_fun(copy(DF)),  times = 100L)

This gives:

print(perf_results)Unit: milliseconds           expr        min         lq       mean     median         uq        max neval         dt_fun   1.391402    1.560751   1.658337   1.651201   1.716851   2.383801   100   dplyr_ifelse   1.172601    1.230351   1.331538   1.294851   1.390351   1.995701   100dplyr_case_when   1.648201    1.768002   1.860968   1.844101   1.958801   2.207001   100           mosa 255.591301  281.158350 291.391586 286.549802 292.101601 545.880702   100