Multicore and memory usage in R under Ubuntu

Have you tried data.table?

> system.time(ans1 <- do.call("cbind",lapply(subset(sampdata,select=c(a:z)),function(x)tapply(x,sampdata$groupid,sum))))   user  system elapsed 906.157  13.965 928.645 > require(data.table)> DT = as.data.table(sampdata)> setkey(DT,groupid)> system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])   user  system elapsed 186.920   1.056 191.582                # 4.8 times faster> # massage minor diffs in results...> ans2$groupid=NULL> ans2=as.matrix(ans2)> colnames(ans2)=letters> rownames(ans1)=NULL> identical(ans1,ans2)[1] TRUE

Your example is very interesting. It is reasonably large (200MB), there are many groups (1/2 million), and each group is very small (2 rows). The 191s can probably be improved by quite a lot, but at least it's a start. [March 2011]

And now, this idiom (i.e. lapply(.SD,...)) has been improved a lot. With v1.8.2, and on a faster computer than the test above, and with the latest version of R etc, here is the updated comparison :

sampdata <- data.frame(id = 1:1000000)for (letter in letters) sampdata[, letter] <- rnorm(1000000)sampdata$groupid = ceiling(sampdata$id/2)dim(sampdata)# [1] 1000000      28system.time(ans1 <- do.call("cbind",  lapply(subset(sampdata,select=c(a:z)),function(x)    tapply(x,sampdata$groupid,sum))))#   user  system elapsed# 224.57    3.62  228.54DT = as.data.table(sampdata)setkey(DT,groupid)system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])#   user  system elapsed#  11.23    0.01   11.24                # 20 times faster# massage minor diffs in results...ans2[,groupid:=NULL]ans2[,id:=NULL]ans2=as.matrix(ans2)rownames(ans1)=NULLidentical(ans1,ans2)# [1] TRUE

sessionInfo()R version 2.15.1 (2012-06-22)Platform: x86_64-pc-mingw32/x64 (64-bit)locale:[1] LC_COLLATE=English_United Kingdom.1252   LC_CTYPE=English_United Kingdom.1252[3] LC_MONETARY=English_United Kingdom.1252  LC_NUMERIC=C[5] LC_TIME=English_United Kingdom.1252attached base packages:[1] stats     graphics  grDevices datasets  utils     methods   base     other attached packages:[1] data.table_1.8.2 RODBC_1.3-6

r ubuntu multicore

Things I've tried on Ubuntu 64 bit R, ranked in order of success:

Work with fewer cores, as you are doing.
Split the mclapply jobs into pieces, and save the partial results to a database using DBI with append=TRUE.
Use the rm function along with gc() often

I have tried all of these, and mclapply still begins to create larger and larger processes as it runs, leading me to suspect each process is holding onto some sort of residual memory it really doesn't need.

P.S. I was using data.table, and it seems each child process copies the data.table.

CodeHunter

Multicore and memory usage in R under Ubuntu

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last