Summarizing multiple columns with dplyr? [duplicate] Summarizing multiple columns with dplyr? [duplicate] r r

Summarizing multiple columns with dplyr? [duplicate]


In dplyr (>=1.00) you may use across(everything() in summarise to apply a function to all variables:

library(dplyr)df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))#> # A tibble: 3 x 5#>     grp     a     b     c     d#>   <int> <dbl> <dbl> <dbl> <dbl>#> 1     1  3.08  2.98  2.98  2.91#> 2     2  3.03  3.04  2.97  2.87#> 3     3  2.85  2.95  2.95  3.06

Alternatively, the purrrlyr package provides the same functionality:

library(purrrlyr)df %>% slice_rows("grp") %>% dmap(mean)#> # A tibble: 3 x 5#>     grp     a     b     c     d#>   <int> <dbl> <dbl> <dbl> <dbl>#> 1     1  3.08  2.98  2.98  2.91#> 2     2  3.03  3.04  2.97  2.87#> 3     3  2.85  2.95  2.95  3.06

Also don't forget about data.table (use keyby to sort sort groups):

library(data.table)setDT(df)[, lapply(.SD, mean), keyby = grp]#>    grp        a        b        c        d#> 1:   1 3.079412 2.979412 2.979412 2.914706#> 2:   2 3.029126 3.038835 2.967638 2.873786#> 3:   3 2.854701 2.948718 2.951567 3.062678

Let's try to compare performance.

library(dplyr)library(purrrlyr)library(data.table)library(bench)set.seed(123)n <- 10000df <- data.frame(  a = sample(1:5, n, replace = TRUE),   b = sample(1:5, n, replace = TRUE),   c = sample(1:5, n, replace = TRUE),   d = sample(1:5, n, replace = TRUE),   grp = sample(1:3, n, replace = TRUE))dt <- setDT(df)mark(  dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))),  purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),  data.table = dt[, lapply(.SD, mean), keyby = grp],  check = FALSE)#> # A tibble: 3 x 6#>   expression      min   median `itr/sec` mem_alloc `gc/sec`#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>#> 1 dplyr        2.81ms   2.85ms      328.        NA     17.3#> 2 purrrlyr     7.96ms   8.04ms      123.        NA     24.5#> 3 data.table 596.33µs 707.91µs     1409.        NA     10.3


We can summarize by using summarize_at, summarize_all and summarize_if on dplyr 0.7.4. We can set the multiple columns and functions by using vars and funs argument as below code. The left-hand side of funs formula is assigned to suffix of summarized vars. In the dplyr 0.7.4, summarise_each(and mutate_each) is already deprecated, so we cannot use these functions.

options(scipen = 100, dplyr.width = Inf, dplyr.print_max = Inf)library(dplyr)packageVersion("dplyr")# [1] ‘0.7.4’set.seed(123)df <- data_frame(  a = sample(1:5, 10, replace=T),   b = sample(1:5, 10, replace=T),   c = sample(1:5, 10, replace=T),   d = sample(1:5, 10, replace=T),   grp = as.character(sample(1:3, 10, replace=T)) # For convenience, specify character type)df %>% group_by(grp) %>%   summarise_each(.vars = letters[1:4],                 .funs = c(mean="mean"))# `summarise_each()` is deprecated.# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.# To map `funs` over a selection of variables, use `summarise_at()`# Error: Strings must match column names. Unknown columns: mean

You should change to the following code. The following codes all have the same result.

# summarise_atdf %>% group_by(grp) %>%   summarise_at(.vars = letters[1:4],               .funs = c(mean="mean"))df %>% group_by(grp) %>%   summarise_at(.vars = names(.)[1:4],               .funs = c(mean="mean"))df %>% group_by(grp) %>%   summarise_at(.vars = vars(a,b,c,d),               .funs = c(mean="mean"))# summarise_alldf %>% group_by(grp) %>%   summarise_all(.funs = c(mean="mean"))# summarise_ifdf %>% group_by(grp) %>%   summarise_if(.predicate = function(x) is.numeric(x),               .funs = funs(mean="mean"))# A tibble: 3 x 5# grp a_mean b_mean c_mean d_mean# <chr>  <dbl>  <dbl>  <dbl>  <dbl># 1     1   2.80   3.00    3.6   3.00# 2     2   4.25   2.75    4.0   3.75# 3     3   3.00   5.00    1.0   2.00

You can also have multiple functions.

df %>% group_by(grp) %>%   summarise_at(.vars = letters[1:2],               .funs = c(Mean="mean", Sd="sd"))# A tibble: 3 x 5# grp a_Mean b_Mean      a_Sd     b_Sd# <chr>  <dbl>  <dbl>     <dbl>    <dbl># 1     1   2.80   3.00 1.4832397 1.870829# 2     2   4.25   2.75 0.9574271 1.258306# 3     3   3.00   5.00        NA       NA


You can simply pass more arguments to summarise:

df %>% group_by(grp) %>% summarise(mean(a), mean(b), mean(c), mean(d))

Source: local data frame [3 x 5]

  grp  mean(a)  mean(b)  mean(c) mean(d)1   1 2.500000 3.500000 2.000000     3.02   2 3.800000 3.200000 3.200000     2.83   3 3.666667 3.333333 2.333333     3.0