Sublinear Time Oracles for Hierarchical Clustering
Abstract
Learning graph cluster structure using few queries is a classical question in property testing, with the fundamental special case, namely expansion testing, considered in the seminal work of Goldreich and Ron[STOC'96].
The most recent result in this line of work, due to Gluch et al.[SODA'21], designs {\em clustering oracles} for $(k, \e)$-clusterable graphs. These oracles, given a graph whose vertex set can be partitioned into a disjoint union of $k$ clusters (i.e., good expanders) with outer conductances bounded by $\e\ll 1$, provide query access to an $O(\e \log k)$-approximation to this ground truth clustering in time $\approx 2^{\text{poly}(k/\e)} n^{1/2+O(\e)}$ per query.
In this paper we show that it is possible to learn the {\em hierarchical} cluster structure of $(k, \e)$-clusterable graphs in sublinear time. First, we show how to simulate the hierarchical clustering algorithm of Charikar-Chatziafratis[SODA'17] to approximate the Dasgupta cost of a $k$-clusterable graph to within a factor of $O(\sqrt{\log k})$ in $\approx \text{poly}(k)\cdot n^{1/2+O(\e)}$ time assuming oracle access to the clustering. Second, we introduce a natural hierarchical model of clusterable graphs, and give a bona fide clustering oracle for this model, i.e. a small space data structure that can answer hierarchical clustering queries in $\approx\text{poly}(k) \cdot n^{1/2+O(\e)}$ time per query. Notably, in both cases the query time depends on polynomially on the number $k$ of clusters. The second result is the main technical contribution of the paper, and relies on several structural properties of hierarchically clusterable graphs that we hope will be of independent interest in sublinear time spectral graph algorithms.
The most recent result in this line of work, due to Gluch et al.[SODA'21], designs {\em clustering oracles} for $(k, \e)$-clusterable graphs. These oracles, given a graph whose vertex set can be partitioned into a disjoint union of $k$ clusters (i.e., good expanders) with outer conductances bounded by $\e\ll 1$, provide query access to an $O(\e \log k)$-approximation to this ground truth clustering in time $\approx 2^{\text{poly}(k/\e)} n^{1/2+O(\e)}$ per query.
In this paper we show that it is possible to learn the {\em hierarchical} cluster structure of $(k, \e)$-clusterable graphs in sublinear time. First, we show how to simulate the hierarchical clustering algorithm of Charikar-Chatziafratis[SODA'17] to approximate the Dasgupta cost of a $k$-clusterable graph to within a factor of $O(\sqrt{\log k})$ in $\approx \text{poly}(k)\cdot n^{1/2+O(\e)}$ time assuming oracle access to the clustering. Second, we introduce a natural hierarchical model of clusterable graphs, and give a bona fide clustering oracle for this model, i.e. a small space data structure that can answer hierarchical clustering queries in $\approx\text{poly}(k) \cdot n^{1/2+O(\e)}$ time per query. Notably, in both cases the query time depends on polynomially on the number $k$ of clusters. The second result is the main technical contribution of the paper, and relies on several structural properties of hierarchically clusterable graphs that we hope will be of independent interest in sublinear time spectral graph algorithms.