Fast vectorized merge of list of data.frames by row
Try this:
bind.ith.rows <- function(i) do.call(rbind, lapply(sample.list, "[", i, TRUE))nr <- nrow(sample.list[[1]])lapply(1:nr, bind.ith.rows)
A couple of solutions that will make this quicker using data.table
EDIT - with larger dataset showing data.table
awesomeness even more.
# here are some sample data sample.list <- replicate(10000, data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)), simplify = F)
Gabor's fast solution:
# Solution Gaborbind.ith.rows <- function(i) do.call(rbind, lapply(sample.list, "[", i, TRUE))nr <- nrow(sample.list[[1]])system.time(rowbound <- lapply(1:nr, bind.ith.rows))## user system elapsed ## 25.87 0.01 25.92
The data.table function rbindlist
will make this even quicker even when working with data.frames)
library(data.table)fastbind.ith.rows <- function(i) rbindlist(lapply(sample.list, "[", i, TRUE))system.time(fastbound <- lapply(1:nr, fastbind.ith.rows))## user system elapsed ## 13.89 0.00 13.89
A data.table
solution
Here is a solution that uses data.tables - it is split
solution on steroids.
# data.table solutionsystem.time({ # change each element of sample.list to a data.table (and data.frame) this # is done instaneously by reference invisible(lapply(sample.list, setattr, name = "class", value = c("data.table", "data.frame"))) # combine into a big data set bigdata <- rbindlist(sample.list) # add a row index column (by refere3nce) index <- as.character(seq_len(nr)) bigdata[, `:=`(rowid, index)] # set the key for binary searches setkey(bigdata, rowid) # split on this - dt_list <- lapply(index, function(i, j, x) x[i = J(i)], x = bigdata) # if you want to drop the `row id` column invisible(lapply(dt_list, function(x) set(x, j = "rowid", value = NULL))) # if you really don't want them to be data.tables run this line # invisible(lapply(dt_list, setattr,name = 'class', value = # c('data.frame')))})################################## user system elapsed #### 0.08 0.00 0.08 ##################################
How awesome is data.table
!
Caveat user with rbindlist
rbindlist
is fast because it does not perform the checking that do.call(rbind,....)
will. For example it assumes that any factor columns have the same levels as in the first element of the list.
Here's my attempt with plyr, but I like G. Grothendieck's approach:
library(plyr)alply(do.call("cbind",sample.list), 1, .fun=matrix, ncol=ncol(sample.list[[1]]), byrow=TRUE, dimnames=list(1:length(sample.list), names(sample.list[[1]]) ))