Fast Hierarchical Clustering Using fastcluster

Do you use hierarchical clustering packages like R’s hclust or Python’s scipy.cluster.hierarchy.linkage in your workflow? If so, you’re using an \(O(N^3)\) algorithm¹ and should switch to the fastcluster package, which provides \(O(N^2)\) routines for the most commonly used types of clustering.

fastcluster is implemented in C++, with interfaces for C++, R, and Python. In particular, the Python interface mirrors scipy.cluster.hierarchy.linkage, and the R interface mirrors stats::hclust and flashClust::flashClust, so switching over is a no-brainer.

For a performance comparison provided by the package’s author, take a look here. fastcluster is described in a Journal of Statistical Software publication from 2013, with algorithmic details in a 2011 arXiv paper. The key algorithms used to get to \(O(N^2)\) are a variant of Prim’s algorithm for minimum spanning trees and the nearest-neighbor chain algorithm. Note that centroid and median linkage still take \(O(N^3)\) time with this package.

I haven’t read the details of these algorithms yet, but if they are correct, then the Wikipedia articles on hierarchical clustering and UPGMA (average linkage clustering) could use an update².

Where \(N\) is the number of data points being clustered, obviously.↩
As of October 2019.↩

Brian Zhang's blog

Statistics and other topics

Fast Hierarchical Clustering Using fastcluster