Multicore and memory usage in R under Ubuntu Multicore and memory usage in R under Ubuntu r r

Multicore and memory usage in R under Ubuntu


Have you tried data.table?

> system.time(ans1 <- do.call("cbind",lapply(subset(sampdata,select=c(a:z)),function(x)tapply(x,sampdata$groupid,sum))))   user  system elapsed 906.157  13.965 928.645 > require(data.table)> DT = as.data.table(sampdata)> setkey(DT,groupid)> system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])   user  system elapsed 186.920   1.056 191.582                # 4.8 times faster> # massage minor diffs in results...> ans2$groupid=NULL> ans2=as.matrix(ans2)> colnames(ans2)=letters> rownames(ans1)=NULL> identical(ans1,ans2)[1] TRUE

Your example is very interesting. It is reasonably large (200MB), there are many groups (1/2 million), and each group is very small (2 rows). The 191s can probably be improved by quite a lot, but at least it's a start. [March 2011]


And now, this idiom (i.e. lapply(.SD,...)) has been improved a lot. With v1.8.2, and on a faster computer than the test above, and with the latest version of R etc, here is the updated comparison :

sampdata <- data.frame(id = 1:1000000)for (letter in letters) sampdata[, letter] <- rnorm(1000000)sampdata$groupid = ceiling(sampdata$id/2)dim(sampdata)# [1] 1000000      28system.time(ans1 <- do.call("cbind",  lapply(subset(sampdata,select=c(a:z)),function(x)    tapply(x,sampdata$groupid,sum))))#   user  system elapsed# 224.57    3.62  228.54DT = as.data.table(sampdata)setkey(DT,groupid)system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])#   user  system elapsed#  11.23    0.01   11.24                # 20 times faster# massage minor diffs in results...ans2[,groupid:=NULL]ans2[,id:=NULL]ans2=as.matrix(ans2)rownames(ans1)=NULLidentical(ans1,ans2)# [1] TRUE


sessionInfo()R version 2.15.1 (2012-06-22)Platform: x86_64-pc-mingw32/x64 (64-bit)locale:[1] LC_COLLATE=English_United Kingdom.1252   LC_CTYPE=English_United Kingdom.1252[3] LC_MONETARY=English_United Kingdom.1252  LC_NUMERIC=C[5] LC_TIME=English_United Kingdom.1252attached base packages:[1] stats     graphics  grDevices datasets  utils     methods   base     other attached packages:[1] data.table_1.8.2 RODBC_1.3-6     


Things I've tried on Ubuntu 64 bit R, ranked in order of success:

  • Work with fewer cores, as you are doing.

  • Split the mclapply jobs into pieces, and save the partial results to a database using DBI with append=TRUE.

  • Use the rm function along with gc() often

I have tried all of these, and mclapply still begins to create larger and larger processes as it runs, leading me to suspect each process is holding onto some sort of residual memory it really doesn't need.

P.S. I was using data.table, and it seems each child process copies the data.table.