Multicore and memory usage in R under Ubuntu
Have you tried data.table?
> system.time(ans1 <- do.call("cbind",lapply(subset(sampdata,select=c(a:z)),function(x)tapply(x,sampdata$groupid,sum)))) user system elapsed 906.157 13.965 928.645 > require(data.table)> DT = as.data.table(sampdata)> setkey(DT,groupid)> system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid]) user system elapsed 186.920 1.056 191.582 # 4.8 times faster> # massage minor diffs in results...> ans2$groupid=NULL> ans2=as.matrix(ans2)> colnames(ans2)=letters> rownames(ans1)=NULL> identical(ans1,ans2)[1] TRUE
Your example is very interesting. It is reasonably large (200MB), there are many groups (1/2 million), and each group is very small (2 rows). The 191s can probably be improved by quite a lot, but at least it's a start. [March 2011]
And now, this idiom (i.e. lapply(.SD,...)
) has been improved a lot. With v1.8.2, and on a faster computer than the test above, and with the latest version of R etc, here is the updated comparison :
sampdata <- data.frame(id = 1:1000000)for (letter in letters) sampdata[, letter] <- rnorm(1000000)sampdata$groupid = ceiling(sampdata$id/2)dim(sampdata)# [1] 1000000 28system.time(ans1 <- do.call("cbind", lapply(subset(sampdata,select=c(a:z)),function(x) tapply(x,sampdata$groupid,sum))))# user system elapsed# 224.57 3.62 228.54DT = as.data.table(sampdata)setkey(DT,groupid)system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])# user system elapsed# 11.23 0.01 11.24 # 20 times faster# massage minor diffs in results...ans2[,groupid:=NULL]ans2[,id:=NULL]ans2=as.matrix(ans2)rownames(ans1)=NULLidentical(ans1,ans2)# [1] TRUE
sessionInfo()R version 2.15.1 (2012-06-22)Platform: x86_64-pc-mingw32/x64 (64-bit)locale:[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C[5] LC_TIME=English_United Kingdom.1252attached base packages:[1] stats graphics grDevices datasets utils methods base other attached packages:[1] data.table_1.8.2 RODBC_1.3-6
Things I've tried on Ubuntu 64 bit R, ranked in order of success:
Work with fewer cores, as you are doing.
Split the mclapply jobs into pieces, and save the partial results to a database using DBI with append=TRUE.
Use the
rm
function along withgc()
often
I have tried all of these, and mclapply still begins to create larger and larger processes as it runs, leading me to suspect each process is holding onto some sort of residual memory it really doesn't need.
P.S. I was using data.table, and it seems each child process copies the data.table.