Do you use hierarchical clustering packages like R’s `hclust`

or Python’s `scipy.cluster.hierarchy.linkage`

in your workflow? If so, you’re using an \(O(N^3)\) algorithm^{1} and should switch to the `fastcluster`

package, which provides \(O(N^2)\) routines for the most commonly used types of clustering.

`fastcluster`

is implemented in C++, with interfaces for C++, R, and Python. In particular, the Python interface mirrors `scipy.cluster.hierarchy.linkage`

, and the R interface mirrors `stats::hclust`

and `flashClust::flashClust`

, so switching over is a no-brainer.

For a performance comparison provided by the package’s author, take a look here. `fastcluster`

is described in a Journal of Statistical Software publication from 2013, with algorithmic details in a 2011 arXiv paper. The key algorithms used to get to \(O(N^2)\) are a variant of Prim’s algorithm for minimum spanning trees and the nearest-neighbor chain algorithm. Note that centroid and median linkage still take \(O(N^3)\) time with this package.

I haven’t read the details of these algorithms yet, but if they are correct, then the Wikipedia articles on hierarchical clustering and UPGMA (average linkage clustering) could use an update^{2}.