Computing sparse pairwise distance matrix in R

Well, we can't have you resorting to for-loops, now can we :)

There is of course the question of how to represent the sparse matrix. A simple way is to have it only contain the indices of the points that are closest (and recalculate as needed). But in the solution below, I put both distance ('d1' etc) and index ('i1' etc) in a single matrix:

sparseDist <- function(m, k) {    m <- t(m)    n <- ncol(m)    d <- vapply( seq_len(n-1L), function(i) {         d<-colSums((m[, seq(i+1L, n), drop=FALSE]-m[,i])^2)        o<-sort.list(d, na.last=NA, method='quick')[seq_len(k)]        c(sqrt(d[o]), o+i)         }, numeric(2*k)    )    dimnames(d) <- list(c(paste('d', seq_len(k), sep=''),        paste('i', seq_len(k), sep='')), colnames(m)[-n])    d}

Trying it out on 9 2d-points:

> m <- matrix(c(0,0, 1.1,0, 2,0, 0,1.2, 1.1,1.2, 2,1.2, 0,2, 1.1,2, 2,2),              9, byrow=TRUE, dimnames=list(letters[1:9], letters[24:25]))> print(dist(m), digits=2)    a   b   c   d   e   f   g   hb 1.1                            c 2.0 0.9                        d 1.2 1.6 2.3                    e 1.6 1.2 1.5 1.1                f 2.3 1.5 1.2 2.0 0.9            g 2.0 2.3 2.8 0.8 1.4 2.2        h 2.3 2.0 2.2 1.4 0.8 1.2 1.1    i 2.8 2.2 2.0 2.2 1.2 0.8 2.0 0.9> print(sparseDist(m, 3), digits=2)     a   b   c   d   e   f   g   hd1 1.1 0.9 1.2 0.8 0.8 0.8 1.1 0.9d2 1.2 1.2 1.5 1.1 0.9 1.2 2.0  NAd3 1.6 1.5 2.0 1.4 1.2 2.2  NA  NAi1 2.0 3.0 6.0 7.0 8.0 9.0 8.0 9.0i2 4.0 5.0 5.0 5.0 6.0 8.0 9.0  NAi3 5.0 6.0 9.0 8.0 9.0 7.0  NA  NA

And trying it on a larger problem (10k points). Still, on 100k points and more dimensions it will take a long time (like 15-30 minutes).

n<-1e4; m<-3; m=matrix(runif(n*m), n)system.time( d <- sparseDist(m, 3) ) # 9 seconds on my machine...

P.S. Just noted that you posted an answer as I was writing this: the solution here is roughly twice as fast because it doesn't calculate the same distance twice (the distance between points 1 and 13 is the same as between points 13 and 1).

r distance sparse-matrix knn

For now I am using the following, inspired by this answer. The output is a n x k matrix where element (i,k) is the index of the data point that is the kth closest to i.

n <- 10d <- 3x <- matrix(rnorm(n * d), ncol = n)min.k.dists <- function(x,k=5) {  apply(x,2,function(r) {    b <- colSums((x - r)^2)    o <- order(b)    o[1:k]  })}min.k.dists(x)  # first row should be 1:ncol(x); these points have distance 0dist(t(x))      # can check answer against this

If one is worried about how ties are handled and whatnot, perhaps rank() should be incorporated.

The above code seems somewhat fast, but I'm sure it could be improved (though I don't have time to go the C or fortran route). So I'm still open to fast and sparse implementations of the above.

Below I include a parallelized version that I ended up using:

min.k.dists <- function(x,k=5,cores=1) {  require(multicore)  xx <- as.list(as.data.frame(x))  names(xx) <- c()  m <- mclapply(xx,function(r) {    b <- colSums((x - r)^2)    o <- order(b)    o[1:k]  },mc.cores=cores)  t(do.call(rbind,m))}

r distance sparse-matrix knn

If you want to keep the logic of your min.k.dist function and return duplicate distances, you might want to consider modifying it a bit. It seems pointless to return the first line with 0 distance, right? ...and by incorporating some of the tricks in my other answer, you can speed up your version by some 30%:

min.k.dists2 <- function(x, k=4L) {  k <- max(2L, k + 1L)  apply(x, 2, function(r) {    sort.list(colSums((x - r)^2), na.last=NA, method='quick')[2:k]  })}> n<-1e4; m<-3; m=matrix(runif(n*m), n)> system.time(d <- min.k.dists(t(m), 4)) #To get 3 nearest neighbours and itself   user  system elapsed   17.26    0.00   17.30 > system.time(d <- min.k.dists2(t(m), 3)) #To get 3 nearest neighbours   user  system elapsed    12.7     0.0    12.7

CodeHunter

Computing sparse pairwise distance matrix in R

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last