To make a distance matrix or to repeatedly calculate distance To make a distance matrix or to repeatedly calculate distance hadoop hadoop

To make a distance matrix or to repeatedly calculate distance


A 3rd approach could be a combination of both, and is lazily evaluating the distance matrix. Initialize a matrix with default values (unrealistic values, like negative ones), and when you need to calculate distance between two points, if the values is already present in the matrix - just take it from it.Otherwise, calculate it and store it in the matrix.

This approach trades calculations (and is optimal in doing the lowest number of possible pair calculations), for more branches in the code, and a few more instructions. However, due to branch predictors, I assume this overhead will not be that dramatic.
I predict it will have better performance when the calculation is relatively expansive.

Another optimization of it could be to dynamically switch for a plain matrix implementation (and calculate the remaining part of the matrix) when the number of already calculated exceeds a certain threshold. This can be achieved pretty nicely in OOP languages, by switching the implementation of the interface when a certain threshold is met.

Which is actually better implementation is going to rely heavily on the cost of the distance function, and the data you are clustering, as some will need to calculate the same points more often than other data sets.
I suggest doing a benchmark, and using statistical tools to evaluate which method is actually better.