Parallel distance Matrix in R
The R package amap provides robust and parallelized functions for Clustering and Principal Component Analysis. Among these functions, Dist method offers what you are looking for: computes and returns the distance matrix in a parallel manner.
Dist(x, method = "euclidean", nbproc = 8)
The code above compute euclidean distance with 8 threads.
Here's the structure for one route you could go. It is not faster than just using the dist()
function, instead taking many times longer. It does process in parallel, but even if the computation time were reduced to zero, the time to start up the function and export the variables to the cluster would probably be longer than just using dist()
library(parallel)vec.array <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)TaxiDistFun <- function(one.vec, whole.matrix) { diff.matrix <- t(t(whole.matrix) - one.vec) this.row <- apply(diff.matrix, 1, function(x) sum(abs(x))) return(this.row)}cl <- makeCluster(detectCores())clusterExport(cl, list("vec.array", "TaxiDistFun"))system.time(dist.array <- parRapply(cl, vec.array, function(x) TaxiDistFun(x, vec.array)))stopCluster(cl)dim(dist.array) <- c(2000, 2000)
You can also use the parDist
function of the parallelDist package, which is specifically built for parallelized distance matrix computations. Advantages are that the package is available on Mac OS, Windows and Linux and already supports 39 different distance measures (see parDist).
Performance comparison for manhattan distance (Sys spec: Mac OS; Intel Core i7 with 4 cores @ 2,5 GHz and hyperthreading enabled):
library(parallelDist)library(amap)library(wordspace)library(microbenchmark)set.seed(123)x <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)microbenchmark(parDist(x, method = "manhattan"), Dist(x, method = "manhattan", nbproc = 8), dist.matrix(x, method = "manhattan"), times = 10)Unit: milliseconds expr min lq mean median uq max neval parDist(x, method = "manhattan") 210.9478 214.3557 225.5894 221.3705 237.9829 247.0844 10 Dist(x, method = "manhattan", nbproc = 8) 749.9397 755.7351 797.6349 812.6109 824.4075 844.1090 10 dist.matrix(x, method = "manhattan") 256.0831 263.3273 279.0864 275.1882 296.3256 311.3821 10
With a larger matrix:
x <- matrix(rnorm(10000 * 100), nrow = 10000, ncol = 100)microbenchmark(parDist(x, method = "manhattan"),+ Dist(x, method = "manhattan", nbproc = 8),+ dist.matrix(x, method = "manhattan"),+ times = 10)Unit: seconds expr min lq mean median uq max neval parDist(x, method = "manhattan") 6.298234 6.388501 6.737168 6.894203 6.947981 7.221661 10 Dist(x, method = "manhattan", nbproc = 8) 22.722947 24.113681 24.326157 24.477034 24.658145 25.301353 10 dist.matrix(x, method = "manhattan") 7.156861 7.505229 7.544352 7.567980 7.655624 7.800530 10
Further performance comparisons can be found in parallelDist
's vignette.