Improving Ultrametrics Embeddings Through Coresets
Abstract
To tackle the curse of dimensionality in data analysis and
unsupervised learning, it is critical to be able to efficiently compute ``simple'' faithful representations of
the data that helps extract information, improves understanding and visualization of the structure.
When the dataset consists of $d$-dimensional vectors, simple representations
of the data may consist in trees or ultrametrics, and the goal is to best preserve the distances (i.e.: dissimilarity values)
between data elements.
To circumvent the quadratic running times of the most popular methods for fitting ultrametrics,
such as average, single, or complete linkage,~\citet{CKL20} recently presented a new algorithm
that for any $c \ge 1$, outputs in time $n^{1+O(1/c^2)}$ an ultrametric $\Delta$ such that for any two
points $u, v$, $\Delta(u, v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in
the ``best'' ultrametric representation.
We improve the above result and show how to improve the above guarantee from $5c$ to $\sqrt{2}c + \varepsilon$ while achieving
the same asymptotic running time. To further show the advantage of our new method, we experimentally analyze the performances
of our algorithm which indeed yields embeddings of significantly better quality for various real-world datasets.
unsupervised learning, it is critical to be able to efficiently compute ``simple'' faithful representations of
the data that helps extract information, improves understanding and visualization of the structure.
When the dataset consists of $d$-dimensional vectors, simple representations
of the data may consist in trees or ultrametrics, and the goal is to best preserve the distances (i.e.: dissimilarity values)
between data elements.
To circumvent the quadratic running times of the most popular methods for fitting ultrametrics,
such as average, single, or complete linkage,~\citet{CKL20} recently presented a new algorithm
that for any $c \ge 1$, outputs in time $n^{1+O(1/c^2)}$ an ultrametric $\Delta$ such that for any two
points $u, v$, $\Delta(u, v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in
the ``best'' ultrametric representation.
We improve the above result and show how to improve the above guarantee from $5c$ to $\sqrt{2}c + \varepsilon$ while achieving
the same asymptotic running time. To further show the advantage of our new method, we experimentally analyze the performances
of our algorithm which indeed yields embeddings of significantly better quality for various real-world datasets.