MapReduce and Its Application to Massively Parallel Learning of Decision Tree Ensembles
Abstract
In this chapter we look at leveraging the MapReduce distributed computing framework
(Dean and Ghemawat, 2004) for parallelizing machine learning methods of wide
interest, with a specific focus on learning ensembles of classification or regression
trees. Building a production-ready implementation of a distributed learning algorithm
can be a complex task. With the wide and growing availability of MapReduce-capable
computing infrastructures, it is natural to ask whether such infrastructures may be of
use in parallelizing common data mining tasks such as tree learning. For many data
mining applications, MapReduce may offer scalability as well as ease of deployment
in a production setting (for reasons explained later).
We initially give an overview of MapReduce and outline its application in a classic
clustering algorithm, k-means. Subsequently, we focus on PLANET: a scalable distributed
framework for learning tree models over large datasets. PLANET defines tree
learning as a series of distributed computations and implements each one using the
MapReduce model. We show how this framework supports scalable construction of
classification and regression trees, as well as ensembles of such models. We discuss
the benefits and challenges of using a MapReduce compute cluster for tree learning
and demonstrate the scalability of this approach by applying it to a real-world learning
task from the domain of computational advertising.
MapReduce is a simple model for distributed computing that abstracts away many of
the difficulties in parallelizing data management operations across a cluster of commodity
machines. By using MapReduce, one can alleviate, if not eliminate, many complexities
such as data partitioning, scheduling tasks across many machines, handling machine
failures, and performing inter-machine communication. These properties have motivated
many companies to run MapReduce frameworks on their compute clusters for data
analysis and other data management tasks. MapReduce has become in some sense an
industry standard. For example, there are open-source implementations such as Hadoop
that can be run either in-house or on cloud computing services such as Amazon EC2.
(Dean and Ghemawat, 2004) for parallelizing machine learning methods of wide
interest, with a specific focus on learning ensembles of classification or regression
trees. Building a production-ready implementation of a distributed learning algorithm
can be a complex task. With the wide and growing availability of MapReduce-capable
computing infrastructures, it is natural to ask whether such infrastructures may be of
use in parallelizing common data mining tasks such as tree learning. For many data
mining applications, MapReduce may offer scalability as well as ease of deployment
in a production setting (for reasons explained later).
We initially give an overview of MapReduce and outline its application in a classic
clustering algorithm, k-means. Subsequently, we focus on PLANET: a scalable distributed
framework for learning tree models over large datasets. PLANET defines tree
learning as a series of distributed computations and implements each one using the
MapReduce model. We show how this framework supports scalable construction of
classification and regression trees, as well as ensembles of such models. We discuss
the benefits and challenges of using a MapReduce compute cluster for tree learning
and demonstrate the scalability of this approach by applying it to a real-world learning
task from the domain of computational advertising.
MapReduce is a simple model for distributed computing that abstracts away many of
the difficulties in parallelizing data management operations across a cluster of commodity
machines. By using MapReduce, one can alleviate, if not eliminate, many complexities
such as data partitioning, scheduling tasks across many machines, handling machine
failures, and performing inter-machine communication. These properties have motivated
many companies to run MapReduce frameworks on their compute clusters for data
analysis and other data management tasks. MapReduce has become in some sense an
industry standard. For example, there are open-source implementations such as Hadoop
that can be run either in-house or on cloud computing services such as Amazon EC2.