Scalable Feature Selection via Distributed Diversity Maximization

Sepehr Abbasi Zadeh

Mehrdad Ghadiri

Vahab S. Mirrokni

Morteza Zadimoghaddam

AAAI (2017), pp. 2876-2883

Download Google Scholar

Abstract

Feature selection is a fundamental problem in machine learning
and data mining. The majority of feature selection algorithms
are designed for running on a single machine (centralized
setting) and they are less applicable to very large
datasets. Although there are some distributed methods to
tackle this problem, most of them are distributing the data
horizontally which are not suitable for datasets with a large
number of features and few number of instances. Thus, in this
paper, we introduce a novel vertically distributable feature selection
method in order to speed up this process and be able
to handle very large datasets in a scalable manner. In general,
feature selection methods aim at selecting relevant and
non-redundant features (Minimum Redundancy and Maximum
Relevance). It is much harder to consider redundancy
in a vertically distributed setting than a centralized setting
since there is no global access to the whole data. To the best
of our knowledge, this is the first attempt toward solving the
feature selection problem with a vertically distributed filter
method which handles the redundancy with consistently comparable
results with centralized methods. In this paper, we
formalize the feature selection problem as a diversity maximization
problem by introducing a mutual-information-based
metric distance on the features. We show the effectiveness of
our method by performing an extensive empirical study. In
particular, we show that our distributed method outperforms
state-of-the-art centralized feature selection algorithms on a
variety of datasets. From a theoretical point of view, we have
proved that the used greedy algorithm in our method achieves
an approximation factor of 1/4 for the diversity maximization
problem in a distributed setting with high probability. Furthermore,
we improve this to 8/25 expected approximation
using multiplicity in our distribution.

Research Areas

Algorithms and Theory

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Scalable Feature Selection via Distributed Diversity Maximization

Abstract

Research Areas

Learn more about how we conduct our research