Scalable Feature Selection via Distributed Diversity Maximization
Abstract
Feature selection is a fundamental problem in machine learning
and data mining. The majority of feature selection algorithms
are designed for running on a single machine (centralized
setting) and they are less applicable to very large
datasets. Although there are some distributed methods to
tackle this problem, most of them are distributing the data
horizontally which are not suitable for datasets with a large
number of features and few number of instances. Thus, in this
paper, we introduce a novel vertically distributable feature selection
method in order to speed up this process and be able
to handle very large datasets in a scalable manner. In general,
feature selection methods aim at selecting relevant and
non-redundant features (Minimum Redundancy and Maximum
Relevance). It is much harder to consider redundancy
in a vertically distributed setting than a centralized setting
since there is no global access to the whole data. To the best
of our knowledge, this is the first attempt toward solving the
feature selection problem with a vertically distributed filter
method which handles the redundancy with consistently comparable
results with centralized methods. In this paper, we
formalize the feature selection problem as a diversity maximization
problem by introducing a mutual-information-based
metric distance on the features. We show the effectiveness of
our method by performing an extensive empirical study. In
particular, we show that our distributed method outperforms
state-of-the-art centralized feature selection algorithms on a
variety of datasets. From a theoretical point of view, we have
proved that the used greedy algorithm in our method achieves
an approximation factor of 1/4 for the diversity maximization
problem in a distributed setting with high probability. Furthermore,
we improve this to 8/25 expected approximation
using multiplicity in our distribution.
and data mining. The majority of feature selection algorithms
are designed for running on a single machine (centralized
setting) and they are less applicable to very large
datasets. Although there are some distributed methods to
tackle this problem, most of them are distributing the data
horizontally which are not suitable for datasets with a large
number of features and few number of instances. Thus, in this
paper, we introduce a novel vertically distributable feature selection
method in order to speed up this process and be able
to handle very large datasets in a scalable manner. In general,
feature selection methods aim at selecting relevant and
non-redundant features (Minimum Redundancy and Maximum
Relevance). It is much harder to consider redundancy
in a vertically distributed setting than a centralized setting
since there is no global access to the whole data. To the best
of our knowledge, this is the first attempt toward solving the
feature selection problem with a vertically distributed filter
method which handles the redundancy with consistently comparable
results with centralized methods. In this paper, we
formalize the feature selection problem as a diversity maximization
problem by introducing a mutual-information-based
metric distance on the features. We show the effectiveness of
our method by performing an extensive empirical study. In
particular, we show that our distributed method outperforms
state-of-the-art centralized feature selection algorithms on a
variety of datasets. From a theoretical point of view, we have
proved that the used greedy algorithm in our method achieves
an approximation factor of 1/4 for the diversity maximization
problem in a distributed setting with high probability. Furthermore,
we improve this to 8/25 expected approximation
using multiplicity in our distribution.