How to sum a variable by group
Using aggregate
:
aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum) Category x1 First 302 Second 53 Third 34
In the example above, multiple dimensions can be specified in the list
. Multiple aggregated metrics of the same data type can be incorporated via cbind
:
aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...
(embedding @thelatemail comment), aggregate
has a formula interface too
aggregate(Frequency ~ Category, x, sum)
Or if you want to aggregate multiple columns, you could use the .
notation (works for one column too)
aggregate(. ~ Category, x, sum)
or tapply
:
tapply(x$Frequency, x$Category, FUN=sum) First Second Third 30 5 34
Using this data:
x <- data.frame(Category=factor(c("First", "First", "First", "Second", "Third", "Third", "Second")), Frequency=c(10,15,5,2,14,20,3))
You can also use the dplyr package for that purpose:
library(dplyr)x %>% group_by(Category) %>% summarise(Frequency = sum(Frequency))#Source: local data frame [3 x 2]## Category Frequency#1 First 30#2 Second 5#3 Third 34
Or, for multiple summary columns (works with one column too):
x %>% group_by(Category) %>% summarise(across(everything(), sum))
Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset mtcars
:
# several summary columns with arbitrary namesmtcars %>% group_by(cyl, gear) %>% # multiple group columns summarise(max_hp = max(hp), mean_mpg = mean(mpg)) # multiple summary columns# summarise all columns except grouping columns using "sum" mtcars %>% group_by(cyl) %>% summarise(across(everything(), sum))# summarise all columns except grouping columns using "sum" and "mean"mtcars %>% group_by(cyl) %>% summarise(across(everything(), list(mean = mean, sum = sum)))# multiple grouping columnsmtcars %>% group_by(cyl, gear) %>% summarise(across(everything(), list(mean = mean, sum = sum)))# summarise specific variables, not allmtcars %>% group_by(cyl, gear) %>% summarise(across(c(qsec, mpg, wt), list(mean = mean, sum = sum)))# summarise specific variables (numeric columns except grouping columns)mtcars %>% group_by(gear) %>% summarise(across(where(is.numeric), list(mean = mean, sum = sum)))
For more information, including the %>%
operator, see the introduction to dplyr.
The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:
library(data.table)data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), Frequency=c(10,15,5,2,14,20,3))data[, sum(Frequency), by = Category]# Category V1# 1: First 30# 2: Second 5# 3: Third 34system.time(data[, sum(Frequency), by = Category] )# user system elapsed # 0.008 0.001 0.009
Let's compare that to the same thing using data.frame and the above above:
data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"), Frequency=c(10,15,5,2,14,20,3))system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))# user system elapsed # 0.008 0.000 0.015
And if you want to keep the column this is the syntax:
data[,list(Frequency=sum(Frequency)),by=Category]# Category Frequency# 1: First 30# 2: Second 5# 3: Third 34
The difference will become more noticeable with larger datasets, as the code below demonstrates:
data = data.table(Category=rep(c("First", "Second", "Third"), 100000), Frequency=rnorm(100000))system.time( data[,sum(Frequency),by=Category] )# user system elapsed # 0.055 0.004 0.059 data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), Frequency=rnorm(100000))system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )# user system elapsed # 0.287 0.010 0.296
For multiple aggregations, you can combine lapply
and .SD
as follows
data[, lapply(.SD, sum), by = Category]# Category Frequency# 1: First 30# 2: Second 5# 3: Third 34