Why is the parallel package slower than just using apply?

r parallel-processing

Running jobs in parallel incurs overhead. Only if the jobs you fire at the worker nodes take a significant amount of time does parallelization improve overall performance. When the individual jobs take only milliseconds, the overhead of constantly firing off jobs will deteriorate overall performance. The trick is to divide the work over the nodes in such a way that the jobs are sufficiently long, say at least a few seconds. I used this to great effect running six Fortran models simultaneously, but these individual model runs took hours, almost negating the effect of overhead.

Note that I haven't run your example, but the situation I describe above is often the issue when parallization takes longer than running sequentially.

r parallel-processing

These differences can be attributed to 1) communication overhead (especially if you run across nodes) and 2) performance overhead (if your job is not that intensive compared to initiating a parallelisation, for example). Usually, if the task you are parallelising is not that time-consuming, then you will mostly find that parallelisation does NOT have much of an effect (which is much highly visible on huge datasets.

Even though this may not directly answer your benchmarking, I hope this should be rather straightforward and can be related to. As an example, here, I construct a data.frame with 1e6 rows with 1e4 unique column group entries and some values in column val. And then I run using plyr in parallel using doMC and without parallelisation.

df <- data.frame(group = as.factor(sample(1:1e4, 1e6, replace = T)),                  val = sample(1:10, 1e6, replace = T))> head(df)  group val# 1  8498   8# 2  5253   6# 3  1495   1# 4  7362   9# 5  2344   6# 6  5602   9> dim(df)# [1] 1000000       2require(plyr)require(doMC)registerDoMC(20) # 20 processors# parallelisation using doMC + plyr P.PLYR <- function() {    o1 <- ddply(df, .(group), function(x) sum(x$val), .parallel = TRUE)}# no parallelisationPLYR <- function() {    o2 <- ddply(df, .(group), function(x) sum(x$val), .parallel = FALSE)}require(rbenchmark)benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")      test replications elapsed relative user.self sys.self user.child sys.child2   PLYR()            2   8.925    1.000     8.865    0.068      0.000     0.0001 P.PLYR()            2  30.637    3.433    15.841   13.945      8.944    38.858

As you can see, the parallel version of plyr runs 3.5 times slower

Now, let me use the same data.frame, but instead of computing sum, let me construct a bit more demanding function, say, median(.) * median(rnorm(1e4) ((meaningless, yes):

You'll see that the tides are beginning to shift:

# parallelisation using doMC + plyr P.PLYR <- function() {    o1 <- ddply(df, .(group), function(x)       median(x$val) * median(rnorm(1e4)), .parallel = TRUE)}# no parallelisationPLYR <- function() {    o2 <- ddply(df, .(group), function(x)          median(x$val) * median(rnorm(1e4)), .parallel = FALSE)}> benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")      test replications elapsed relative user.self sys.self user.child sys.child1 P.PLYR()            2  41.911    1.000    15.265   15.369    141.585    34.2542   PLYR()            2  73.417    1.752    73.372    0.052      0.000     0.000

Here, the parallel version is 1.752 times faster than the non-parallel version.

Edit: Following @Paul's comment, I just implemented a small delay using Sys.sleep(). Of course the results are obvious. But just for the sake of completeness, here's the result on a 20*2 data.frame:

df <- data.frame(group=sample(letters[1:5], 20, replace=T), val=sample(20))# parallelisation using doMC + plyr P.PLYR <- function() {    o1 <- ddply(df, .(group), function(x) {    Sys.sleep(2)    median(x$val)    }, .parallel = TRUE)}# no parallelisationPLYR <- function() {    o2 <- ddply(df, .(group), function(x) {        Sys.sleep(2)        median(x$val)    }, .parallel = FALSE)}> benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")#       test replications elapsed relative user.self sys.self user.child sys.child# 1 P.PLYR()            2   4.116    1.000     0.056    0.056      0.024      0.04# 2   PLYR()            2  20.050    4.871     0.028    0.000      0.000      0.00

The difference here is not surprising.

r parallel-processing

Completely agree with @Arun and @PaulHiemestra arguments concerning Why...? part of your question.

However, it seems that you can take some benefits from parallel package in your situation (at least if you are not stuck with Windows). Possible solution is using mclapply instead of parSapply, which relies on fast forking and shared memory.

  tm2 <- system.time({    tm3 <- system.time({     df7 <- matrix(unlist(mclapply(df2$var1, FUN=function(x) {x==df1$var1}, mc.cores=8)), nrow=i)     dimnames(df7) <- list(row.names(df1), row.names(df2))    })  })

Of course, nested system.time is not needed here. With my 2 cores I got:

enter image description here

CodeHunter

Why is the parallel package slower than just using apply?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last