Exact and Approximate Hierarchical Clustering Using A*
Abstract
Hierarchical clustering is known to be broadly applicable in myriad domains. Despite its extensive use, existing approximate inference methods are insufficient for applications that require either exact or high-quality approximate solutions.For example, in high energy physics, we are interested in discovering high-quality jet structures, which are hierarchical clusterings of particles.In this paper we view inference as a search problem and focus on inferring high-quality hierarchies for a given cost function, rather than using ad hoc (e.g., greedy or beam) methods. This leads naturally to the use of A*, which has seldom been used for clustering (with the notable exception of \citep{daume2007fast}). Unlike ad hoc search methods, A* carries with it optimality guarantees. However, applying A* search naively leads to a large space and time complexity. To address this challenge, we develop a novel augmented trellis data structure and dynamic programming algorithm for A* that result in substantially improved time and space complexity bounds while still computing the globally optimal hierarchical clustering. We demonstrate that our proposed method increases the number of points for which an exact solution can be found by 25$\%$ compared with previous work \cite{greenberg2020compact}. Furthermore, our approach yields a natural approximation that scales to larger datasets and achieves substantially higher quality results than ad hoc search baselines, motivating its use in applications demanding exact or high-quality approximations.