How to sum a variable by group

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)  Category  x1    First 302   Second  53    Third 34

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum) First Second  Third     30      5     34

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",                                      "Third", "Third", "Second")),                     Frequency=c(10,15,5,2,14,20,3))

r dataframe aggregate r-faq

You can also use the dplyr package for that purpose:

library(dplyr)x %>%   group_by(Category) %>%   summarise(Frequency = sum(Frequency))#Source: local data frame [3 x 2]##  Category Frequency#1    First        30#2   Second         5#3    Third        34

Or, for multiple summary columns (works with one column too):

x %>%   group_by(Category) %>%   summarise(across(everything(), sum))

Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset mtcars:

# several summary columns with arbitrary namesmtcars %>%   group_by(cyl, gear) %>%                            # multiple group columns  summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns# summarise all columns except grouping columns using "sum" mtcars %>%   group_by(cyl) %>%   summarise(across(everything(), sum))# summarise all columns except grouping columns using "sum" and "mean"mtcars %>%   group_by(cyl) %>%   summarise(across(everything(), list(mean = mean, sum = sum)))# multiple grouping columnsmtcars %>%   group_by(cyl, gear) %>%   summarise(across(everything(), list(mean = mean, sum = sum)))# summarise specific variables, not allmtcars %>%   group_by(cyl, gear) %>%   summarise(across(c(qsec, mpg, wt), list(mean = mean, sum = sum)))# summarise specific variables (numeric columns except grouping columns)mtcars %>%   group_by(gear) %>%   summarise(across(where(is.numeric), list(mean = mean, sum = sum)))

For more information, including the %>% operator, see the introduction to dplyr.

r dataframe aggregate r-faq

The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:

library(data.table)data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),                   Frequency=c(10,15,5,2,14,20,3))data[, sum(Frequency), by = Category]#    Category V1# 1:    First 30# 2:   Second  5# 3:    Third 34system.time(data[, sum(Frequency), by = Category] )# user    system   elapsed # 0.008     0.001     0.009

Let's compare that to the same thing using data.frame and the above above:

data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),                  Frequency=c(10,15,5,2,14,20,3))system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))# user    system   elapsed # 0.008     0.000     0.015

And if you want to keep the column this is the syntax:

data[,list(Frequency=sum(Frequency)),by=Category]#    Category Frequency# 1:    First        30# 2:   Second         5# 3:    Third        34

The difference will become more noticeable with larger datasets, as the code below demonstrates:

data = data.table(Category=rep(c("First", "Second", "Third"), 100000),                  Frequency=rnorm(100000))system.time( data[,sum(Frequency),by=Category] )# user    system   elapsed # 0.055     0.004     0.059 data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),                   Frequency=rnorm(100000))system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )# user    system   elapsed # 0.287     0.010     0.296

For multiple aggregations, you can combine lapply and .SD as follows

data[, lapply(.SD, sum), by = Category]#    Category Frequency# 1:    First        30# 2:   Second         5# 3:    Third        34

CodeHunter

How to sum a variable by group

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last