Reliable Distributed Clustering with Redundant Data Assignment
Abstract
In this work we present distributed generalized clustering algorithms (with k-means and PCA as
special cases) that can handle large scale data across multiple machines in spite of straggling or
unreliable machines. We propose a novel data assignment scheme that enables us to obtain global information about data even when some machines fail to respond. The assignment scheme leads to
distributed algorithms with good approximation guarantees for a variety of clustering and dimensionality reduction problems.