Parallel distance Matrix in R Parallel distance Matrix in R r r

Parallel distance Matrix in R


The R package amap provides robust and parallelized functions for Clustering and Principal Component Analysis. Among these functions, Dist method offers what you are looking for: computes and returns the distance matrix in a parallel manner.

Dist(x, method = "euclidean", nbproc = 8)

The code above compute euclidean distance with 8 threads.


Here's the structure for one route you could go. It is not faster than just using the dist() function, instead taking many times longer. It does process in parallel, but even if the computation time were reduced to zero, the time to start up the function and export the variables to the cluster would probably be longer than just using dist()

library(parallel)vec.array <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)TaxiDistFun <- function(one.vec, whole.matrix) {    diff.matrix <- t(t(whole.matrix) - one.vec)    this.row <- apply(diff.matrix, 1, function(x) sum(abs(x)))    return(this.row)}cl <- makeCluster(detectCores())clusterExport(cl, list("vec.array", "TaxiDistFun"))system.time(dist.array <- parRapply(cl, vec.array,                        function(x) TaxiDistFun(x, vec.array)))stopCluster(cl)dim(dist.array) <- c(2000, 2000)


You can also use the parDist function of the parallelDist package, which is specifically built for parallelized distance matrix computations. Advantages are that the package is available on Mac OS, Windows and Linux and already supports 39 different distance measures (see parDist).

Performance comparison for manhattan distance (Sys spec: Mac OS; Intel Core i7 with 4 cores @ 2,5 GHz and hyperthreading enabled):

library(parallelDist)library(amap)library(wordspace)library(microbenchmark)set.seed(123)x <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)microbenchmark(parDist(x, method = "manhattan"),               Dist(x, method = "manhattan", nbproc = 8),               dist.matrix(x, method = "manhattan"),               times = 10)Unit: milliseconds                                      expr      min       lq     mean   median       uq      max neval          parDist(x, method = "manhattan") 210.9478 214.3557 225.5894 221.3705 237.9829 247.0844    10 Dist(x, method = "manhattan", nbproc = 8) 749.9397 755.7351 797.6349 812.6109 824.4075 844.1090    10      dist.matrix(x, method = "manhattan") 256.0831 263.3273 279.0864 275.1882 296.3256 311.3821    10

With a larger matrix:

x <- matrix(rnorm(10000 * 100), nrow = 10000, ncol = 100)microbenchmark(parDist(x, method = "manhattan"),+                Dist(x, method = "manhattan", nbproc = 8),+                dist.matrix(x, method = "manhattan"),+                times = 10)Unit: seconds                                      expr       min        lq      mean    median        uq       max neval          parDist(x, method = "manhattan")  6.298234  6.388501  6.737168  6.894203  6.947981  7.221661    10 Dist(x, method = "manhattan", nbproc = 8) 22.722947 24.113681 24.326157 24.477034 24.658145 25.301353    10      dist.matrix(x, method = "manhattan")  7.156861  7.505229  7.544352  7.567980  7.655624  7.800530    10

Further performance comparisons can be found in parallelDist's vignette.