Brian Zhang's blog

Statistics and other topics

Fast Hierarchical Clustering Using fastcluster

Posted Oct 25, 2019 · 1 min read

Do you use hierarchical clustering packages like R’s hclust or Python’s scipy.cluster.hierarchy.linkage in your workflow? If so, you’re using an \(O(N^3)\) algorithm1 and should switch to the fastcluster package, which provides \(O(N^2)\) routines for the most commonly used types of clustering.

fastcluster is implemented in C++, with interfaces for C++, R, and Python. In particular, the Python interface mirrors scipy.cluster.hierarchy.linkage, and the R interface mirrors stats::hclust and flashClust::flashClust, so switching over is a no-brainer.

For a performance comparison provided by the package’s author, take a look here. fastcluster is described in a Journal of Statistical Software publication from 2013, with algorithmic details in a 2011 arXiv paper. The key algorithms used to get to \(O(N^2)\) are a variant of Prim’s algorithm for minimum spanning trees and the nearest-neighbor chain algorithm. Note that centroid and median linkage still take \(O(N^3)\) time with this package.

I haven’t read the details of these algorithms yet, but if they are correct, then the Wikipedia articles on hierarchical clustering and UPGMA (average linkage clustering) could use an update2.

  1. Where \(N\) is the number of data points being clustered, obviously.

  2. As of October 2019.

comments powered by Disqus