Approximate Cluster Recovery from Noisy Labels

Ashkan Norouzi-Fard
Ola Svensson
COLT (2022) (to appear)
Google Scholar

Abstract

Designing algorithms for machine learning problems beyond worst-case analysis and, in particular,analyzing the effect of side-information on the complexity of such problems is a very important line of research with many practical applications. In this paper we study the classic k-means clustering problem in the presence of noisy labels. In this problem we receive as input a set of points and a set of clustering labels generated by either an adversarial or random perturbation of the optimal solution. Our main goal is to formally study the effect of this extra information on the complexity of the k-means problem. In particular, in the context of random perturbations, we give an efficient algorithm that finds a clustering of cost within a factor 1 +o(1)of optimum even when the label of each point is perturbed with a large probability (think 99%). In contrast, we show that side-information with adversarial perturbations is as hard as the original problem even if only a small fraction of the labels are perturbed. We complement this negative result by giving a simple algorithm in the case when the adversary is only allowed to perturb anfraction of each cluster.