Jump to Content

Fitting Distances by Tree Metrics Minimizing the Total Error within a Constant Factor

Debarati Das
Evangelos Kipouridis
Mikkel Thorup
FOCS 2021
Google Scholar

Abstract

We consider the numerical taxonomy problem of fitting a positive distance function $\calD:{S\choose 2}\rightarrow \mathbb R_{>0}$ by a tree metric. We want a tree $T$ with positive edge weights and including $S$ among the vertices so that their distances in $T$ match those in $\calD$. A nice application is in evolutionary biology where the tree $T$ aims to approximate the branching process leading to the observed distances in $\calD$ [Cavalli-Sforza and Edwards 1967]. We consider the total error, that is the sum of distance errors over all pairs of points. We present a polynomial time algorithm minimizing the total error within a constant factor. We can do this both for general trees, and for the special case of ultrametrics with a root having the same distance to all vertices in $S$. The problems are known to be APX-hard, so a constant factor is the best we can hope for. The best previous approximation factor was $\calO((\log n)(\log \log n))$ by Ailon and Charikar [2005] who wrote ``Determining whether an $\calO(1)$ approximation can be obtained is a fascinating question".