Do you use hierarchical clustering packages like R’s hclust
or Python’s scipy.cluster.hierarchy.linkage
in your workflow? If so, you’re using an O(N3) algorithm1 and should switch to the fastcluster
package, which provides O(N2) routines for the most commonly used types of clustering.
fastcluster
is implemented in C++, with interfaces for C++, R, and Python. In particular, the Python interface mirrors scipy.cluster.hierarchy.linkage
, and the R interface mirrors stats::hclust
and flashClust::flashClust
, so switching over is a no-brainer.
For a performance comparison provided by the package’s author, take a look here. fastcluster
is described in a Journal of Statistical Software publication from 2013, with algorithmic details in a 2011 arXiv paper. The key algorithms used to get to O(N2) are a variant of Prim’s algorithm for minimum spanning trees and the nearest-neighbor chain algorithm. Note that centroid and median linkage still take O(N3) time with this package.
I haven’t read the details of these algorithms yet, but if they are correct, then the Wikipedia articles on hierarchical clustering and UPGMA (average linkage clustering) could use an update2.