Aggregate / summarize multiple variables per group (e.g. sum, mean)
Yes, in your formula
, you can cbind
the numeric variables to be aggregated:
aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE) year month x1 x21 2000 1 7.862002 -7.4692982 2001 1 276.758209 474.3842523 2000 2 13.122369 -128.122613...23 2000 12 63.436507 449.79445424 2001 12 999.472226 922.726589
See ?aggregate
, the formula
argument and the examples.
Using the data.table
package, which is fast (useful for larger datasets)
https://github.com/Rdatatable/data.table/wiki
library(data.table)df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]setDF(df2) # convert back to dataframe
Using the plyr package
require(plyr)df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))
Using summarize() from the Hmisc package (column headings are messy in my example though)
# need to detach plyr because plyr and Hmisc both have a summarize()detach(package:plyr)require(Hmisc)df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))
With the dplyr
package, you can use summarise_all
, summarise_at
or summarise_if
functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:
library(dplyr)# summarising all non-grouping variablesdf2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)# summarising a specific set of non-grouping variablesdf2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)# summarising a specific set of non-grouping variables using select_helpers# see ?select_helpers for more optionsdf2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(starts_with('x')), sum)df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(matches('.*[0-9]')), sum)# summarising a specific set of non-grouping variables based on condition (class)df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)
The result of the latter two options:
year month x1 x2 <dbl> <dbl> <dbl> <dbl>1 2000 1 -73.58134 -92.785952 2000 2 -57.81334 -152.369833 2000 3 122.68758 153.552434 2000 4 450.24980 285.563745 2000 5 678.37867 384.428886 2000 6 792.68696 530.286947 2000 7 908.58795 452.312228 2000 8 710.69928 719.352259 2000 9 725.06079 914.9368710 2000 10 770.60304 863.39337# ... with 14 more rows
Note: summarise_each
is deprecated in favor of summarise_all
, summarise_at
and summarise_if
.
As mentioned in my comment above, you can also use the recast
function from the reshape2
-package:
library(reshape2)recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
which will give you the same result.