How to sum a variable by group How to sum a variable by group r r

How to sum a variable by group


Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)  Category  x1    First 302   Second  53    Third 34

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum) First Second  Third     30      5     34 

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",                                      "Third", "Third", "Second")),                     Frequency=c(10,15,5,2,14,20,3))


You can also use the dplyr package for that purpose:

library(dplyr)x %>%   group_by(Category) %>%   summarise(Frequency = sum(Frequency))#Source: local data frame [3 x 2]##  Category Frequency#1    First        30#2   Second         5#3    Third        34

Or, for multiple summary columns (works with one column too):

x %>%   group_by(Category) %>%   summarise(across(everything(), sum))

Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset mtcars:

# several summary columns with arbitrary namesmtcars %>%   group_by(cyl, gear) %>%                            # multiple group columns  summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns# summarise all columns except grouping columns using "sum" mtcars %>%   group_by(cyl) %>%   summarise(across(everything(), sum))# summarise all columns except grouping columns using "sum" and "mean"mtcars %>%   group_by(cyl) %>%   summarise(across(everything(), list(mean = mean, sum = sum)))# multiple grouping columnsmtcars %>%   group_by(cyl, gear) %>%   summarise(across(everything(), list(mean = mean, sum = sum)))# summarise specific variables, not allmtcars %>%   group_by(cyl, gear) %>%   summarise(across(c(qsec, mpg, wt), list(mean = mean, sum = sum)))# summarise specific variables (numeric columns except grouping columns)mtcars %>%   group_by(gear) %>%   summarise(across(where(is.numeric), list(mean = mean, sum = sum)))

For more information, including the %>% operator, see the introduction to dplyr.


The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:

library(data.table)data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),                   Frequency=c(10,15,5,2,14,20,3))data[, sum(Frequency), by = Category]#    Category V1# 1:    First 30# 2:   Second  5# 3:    Third 34system.time(data[, sum(Frequency), by = Category] )# user    system   elapsed # 0.008     0.001     0.009 

Let's compare that to the same thing using data.frame and the above above:

data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),                  Frequency=c(10,15,5,2,14,20,3))system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))# user    system   elapsed # 0.008     0.000     0.015 

And if you want to keep the column this is the syntax:

data[,list(Frequency=sum(Frequency)),by=Category]#    Category Frequency# 1:    First        30# 2:   Second         5# 3:    Third        34

The difference will become more noticeable with larger datasets, as the code below demonstrates:

data = data.table(Category=rep(c("First", "Second", "Third"), 100000),                  Frequency=rnorm(100000))system.time( data[,sum(Frequency),by=Category] )# user    system   elapsed # 0.055     0.004     0.059 data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),                   Frequency=rnorm(100000))system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )# user    system   elapsed # 0.287     0.010     0.296 

For multiple aggregations, you can combine lapply and .SD as follows

data[, lapply(.SD, sum), by = Category]#    Category Frequency# 1:    First        30# 2:   Second         5# 3:    Third        34