How do you compare the "similarity" between two dendrograms (in R)? How do you compare the "similarity" between two dendrograms (in R)? r r

How do you compare the "similarity" between two dendrograms (in R)?


Comparing dendrograms is not quite the same as comparing hierarchical clusterings, because the former includes the lengths of branches as well as the splits, but I also think that's a good start. I would suggest you read E. B. Fowlkes & C. L. Mallows (1983). "A Method for Comparing Two Hierarchical Clusterings". Journal of the American Statistical Association 78 (383): 553–584 (link).

Their approach is based on cutting the trees at each level k, getting a measure Bk that compares the groupings into k clusters, and then examining the Bk vs k plots. The measure Bk is based upon looking at pairs of objects and seeing whether they fall into the same cluster or not.

I am sure that one can write code based on this method, but first we would need to know how the dendrograms are represented in R.


As you know, Dendrograms arise from hierarchical clustering - so what you are really asking is how can I compare the results of two hierarchical clustering runs. There are no standard metrics I know of, but I would be looking at the number of clusters found and comparing membership similarity between like clusters. Here is a good overview of hierarchical clustering that my colleague wrote on clustering scotch whiskey's.


have a look at this page:

I also have similar question asked here

It seems we can use cophenetic correlation to measure the similarity between two dendrograms. But there seems no function for this purpose in R currently.

EDIT at 2014,9,18: The cophenetic function in stats package is capable to calculating the cophenetic dissimilarity matrix. and the correlation can be calculated using cor function. as @Tal has pointed the as.dendrogram function returned the tree with different order, which will cause wrong results if we calculate the correlation based on the dendrogram results. As showed in the example of function cor_cophenetic function in dendextend package:

set.seed(23235)ss <- sample(1:150, 10 )hc1 <- iris[ss,-5] %>% dist %>% hclust("com")hc2 <- iris[ss,-5] %>% dist %>% hclust("single")dend1 <- as.dendrogram(hc1)dend2 <- as.dendrogram(hc2)# cutree(dend1)cophenetic(hc1)cophenetic(hc2)# notice how the dist matrix for the dendrograms have different orders:cophenetic(dend1)cophenetic(dend2)cor(cophenetic(hc1), cophenetic(hc2)) # 0.874cor(cophenetic(dend1), cophenetic(dend2)) # 0.16# the difference is becasue the order of the distance table in the case of# stats:::cophenetic.dendrogram will change between dendrograms!