Fitting Distances by Tree Metrics Minimizing the Total Error within a Constant Factor
Abstract
We consider the numerical taxonomy problem of fitting a positive distance function $\calD:{S\choose 2}\rightarrow \mathbb R_{>0}$ by a tree metric. We want a tree $T$ with positive edge weights and including $S$ among the vertices so that their distances in $T$ match
those in $\calD$. A nice application is in evolutionary biology
where the tree $T$ aims to approximate the branching process leading
to the observed distances in $\calD$ [Cavalli-Sforza and Edwards 1967].
We consider the total error, that is the sum of distance errors over
all pairs of points. We present a polynomial time algorithm
minimizing the total error within a constant factor. We can do this
both for general trees, and for the special case of ultrametrics with
a root having the same distance to all vertices in $S$.
The problems are known to be APX-hard, so a constant factor is the best we can hope for.
The best previous approximation factor was
$\calO((\log n)(\log \log n))$ by Ailon and Charikar [2005] who wrote ``Determining whether an
$\calO(1)$ approximation can be obtained is a fascinating question".
those in $\calD$. A nice application is in evolutionary biology
where the tree $T$ aims to approximate the branching process leading
to the observed distances in $\calD$ [Cavalli-Sforza and Edwards 1967].
We consider the total error, that is the sum of distance errors over
all pairs of points. We present a polynomial time algorithm
minimizing the total error within a constant factor. We can do this
both for general trees, and for the special case of ultrametrics with
a root having the same distance to all vertices in $S$.
The problems are known to be APX-hard, so a constant factor is the best we can hope for.
The best previous approximation factor was
$\calO((\log n)(\log \log n))$ by Ailon and Charikar [2005] who wrote ``Determining whether an
$\calO(1)$ approximation can be obtained is a fascinating question".