Parallel distance Matrix in R

r matrix parallel-processing distance spatial

The R package amap provides robust and parallelized functions for Clustering and Principal Component Analysis. Among these functions, Dist method offers what you are looking for: computes and returns the distance matrix in a parallel manner.

Dist(x, method = "euclidean", nbproc = 8)

The code above compute euclidean distance with 8 threads.

r matrix parallel-processing distance spatial

Here's the structure for one route you could go. It is not faster than just using the dist() function, instead taking many times longer. It does process in parallel, but even if the computation time were reduced to zero, the time to start up the function and export the variables to the cluster would probably be longer than just using dist()

library(parallel)vec.array <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)TaxiDistFun <- function(one.vec, whole.matrix) {    diff.matrix <- t(t(whole.matrix) - one.vec)    this.row <- apply(diff.matrix, 1, function(x) sum(abs(x)))    return(this.row)}cl <- makeCluster(detectCores())clusterExport(cl, list("vec.array", "TaxiDistFun"))system.time(dist.array <- parRapply(cl, vec.array,                        function(x) TaxiDistFun(x, vec.array)))stopCluster(cl)dim(dist.array) <- c(2000, 2000)

r matrix parallel-processing distance spatial

You can also use the parDist function of the parallelDist package, which is specifically built for parallelized distance matrix computations. Advantages are that the package is available on Mac OS, Windows and Linux and already supports 39 different distance measures (see parDist).

Performance comparison for manhattan distance (Sys spec: Mac OS; Intel Core i7 with 4 cores @ 2,5 GHz and hyperthreading enabled):

library(parallelDist)library(amap)library(wordspace)library(microbenchmark)set.seed(123)x <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)microbenchmark(parDist(x, method = "manhattan"),               Dist(x, method = "manhattan", nbproc = 8),               dist.matrix(x, method = "manhattan"),               times = 10)Unit: milliseconds                                      expr      min       lq     mean   median       uq      max neval          parDist(x, method = "manhattan") 210.9478 214.3557 225.5894 221.3705 237.9829 247.0844    10 Dist(x, method = "manhattan", nbproc = 8) 749.9397 755.7351 797.6349 812.6109 824.4075 844.1090    10      dist.matrix(x, method = "manhattan") 256.0831 263.3273 279.0864 275.1882 296.3256 311.3821    10

With a larger matrix:

x <- matrix(rnorm(10000 * 100), nrow = 10000, ncol = 100)microbenchmark(parDist(x, method = "manhattan"),+                Dist(x, method = "manhattan", nbproc = 8),+                dist.matrix(x, method = "manhattan"),+                times = 10)Unit: seconds                                      expr       min        lq      mean    median        uq       max neval          parDist(x, method = "manhattan")  6.298234  6.388501  6.737168  6.894203  6.947981  7.221661    10 Dist(x, method = "manhattan", nbproc = 8) 22.722947 24.113681 24.326157 24.477034 24.658145 25.301353    10      dist.matrix(x, method = "manhattan")  7.156861  7.505229  7.544352  7.567980  7.655624  7.800530    10

Further performance comparisons can be found in parallelDist's vignette.

CodeHunter

Parallel distance Matrix in R

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last