Summarizing multiple columns with dplyr? [duplicate]
In dplyr
(>=1.00) you may use across(everything()
in summarise
to apply a function to all variables:
library(dplyr)df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))#> # A tibble: 3 x 5#> grp a b c d#> <int> <dbl> <dbl> <dbl> <dbl>#> 1 1 3.08 2.98 2.98 2.91#> 2 2 3.03 3.04 2.97 2.87#> 3 3 2.85 2.95 2.95 3.06
Alternatively, the purrrlyr
package provides the same functionality:
library(purrrlyr)df %>% slice_rows("grp") %>% dmap(mean)#> # A tibble: 3 x 5#> grp a b c d#> <int> <dbl> <dbl> <dbl> <dbl>#> 1 1 3.08 2.98 2.98 2.91#> 2 2 3.03 3.04 2.97 2.87#> 3 3 2.85 2.95 2.95 3.06
Also don't forget about data.table
(use keyby
to sort sort groups):
library(data.table)setDT(df)[, lapply(.SD, mean), keyby = grp]#> grp a b c d#> 1: 1 3.079412 2.979412 2.979412 2.914706#> 2: 2 3.029126 3.038835 2.967638 2.873786#> 3: 3 2.854701 2.948718 2.951567 3.062678
Let's try to compare performance.
library(dplyr)library(purrrlyr)library(data.table)library(bench)set.seed(123)n <- 10000df <- data.frame( a = sample(1:5, n, replace = TRUE), b = sample(1:5, n, replace = TRUE), c = sample(1:5, n, replace = TRUE), d = sample(1:5, n, replace = TRUE), grp = sample(1:3, n, replace = TRUE))dt <- setDT(df)mark( dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))), purrrlyr = df %>% slice_rows("grp") %>% dmap(mean), data.table = dt[, lapply(.SD, mean), keyby = grp], check = FALSE)#> # A tibble: 3 x 6#> expression min median `itr/sec` mem_alloc `gc/sec`#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>#> 1 dplyr 2.81ms 2.85ms 328. NA 17.3#> 2 purrrlyr 7.96ms 8.04ms 123. NA 24.5#> 3 data.table 596.33µs 707.91µs 1409. NA 10.3
We can summarize by using summarize_at
, summarize_all
and summarize_if
on dplyr 0.7.4
. We can set the multiple columns and functions by using vars
and funs
argument as below code. The left-hand side of funs formula is assigned to suffix of summarized vars. In the dplyr 0.7.4
, summarise_each
(and mutate_each
) is already deprecated, so we cannot use these functions.
options(scipen = 100, dplyr.width = Inf, dplyr.print_max = Inf)library(dplyr)packageVersion("dplyr")# [1] ‘0.7.4’set.seed(123)df <- data_frame( a = sample(1:5, 10, replace=T), b = sample(1:5, 10, replace=T), c = sample(1:5, 10, replace=T), d = sample(1:5, 10, replace=T), grp = as.character(sample(1:3, 10, replace=T)) # For convenience, specify character type)df %>% group_by(grp) %>% summarise_each(.vars = letters[1:4], .funs = c(mean="mean"))# `summarise_each()` is deprecated.# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.# To map `funs` over a selection of variables, use `summarise_at()`# Error: Strings must match column names. Unknown columns: mean
You should change to the following code. The following codes all have the same result.
# summarise_atdf %>% group_by(grp) %>% summarise_at(.vars = letters[1:4], .funs = c(mean="mean"))df %>% group_by(grp) %>% summarise_at(.vars = names(.)[1:4], .funs = c(mean="mean"))df %>% group_by(grp) %>% summarise_at(.vars = vars(a,b,c,d), .funs = c(mean="mean"))# summarise_alldf %>% group_by(grp) %>% summarise_all(.funs = c(mean="mean"))# summarise_ifdf %>% group_by(grp) %>% summarise_if(.predicate = function(x) is.numeric(x), .funs = funs(mean="mean"))# A tibble: 3 x 5# grp a_mean b_mean c_mean d_mean# <chr> <dbl> <dbl> <dbl> <dbl># 1 1 2.80 3.00 3.6 3.00# 2 2 4.25 2.75 4.0 3.75# 3 3 3.00 5.00 1.0 2.00
You can also have multiple functions.
df %>% group_by(grp) %>% summarise_at(.vars = letters[1:2], .funs = c(Mean="mean", Sd="sd"))# A tibble: 3 x 5# grp a_Mean b_Mean a_Sd b_Sd# <chr> <dbl> <dbl> <dbl> <dbl># 1 1 2.80 3.00 1.4832397 1.870829# 2 2 4.25 2.75 0.9574271 1.258306# 3 3 3.00 5.00 NA NA
You can simply pass more arguments to summarise
:
df %>% group_by(grp) %>% summarise(mean(a), mean(b), mean(c), mean(d))
Source: local data frame [3 x 5]
grp mean(a) mean(b) mean(c) mean(d)1 1 2.500000 3.500000 2.000000 3.02 2 3.800000 3.200000 3.200000 2.83 3 3.666667 3.333333 2.333333 3.0