ExtDict: Extensible Dictionaries for Data- and Platform-Aware Large-Scale Learning
Abstract
This paper proposes ExtDict, a novel data- and
platform-aware framework for iterative analysis/learning of massive
and dense datasets. Iterative execution is prohibitively costly
for distributed architectures where the cost of moving data
is continually growing compared with the cost of arithmetic
computing. ExtDict creates a performance model that quantifies
the computational cost of iterative analysis algorithms on a target
platform in terms of FLOPs, communication, and memory, which
characterize runtime, energy, and storage respectively. The core
of ExtDict is a novel parametric data projection algorithm, called
Extensible Dictionary, that enables versatile and sparse representations
of the data to minimize this computational cost. We
show that ExtDict can achieve the optimal performance objective,
according to our quantified cost model, by platform-aware tuning
of the Extensible Dictionary parameters. An accompanying API
ensures automated applicability of ExtDict to various algorithms,
datasets, and platforms. Proof-of-concept evaluations of massive
and dense data on different platforms demonstrate more than an
order of magnitude improvement in performance compared to the
state-of-the-art, within guaranteed user-defined error bounds.
platform-aware framework for iterative analysis/learning of massive
and dense datasets. Iterative execution is prohibitively costly
for distributed architectures where the cost of moving data
is continually growing compared with the cost of arithmetic
computing. ExtDict creates a performance model that quantifies
the computational cost of iterative analysis algorithms on a target
platform in terms of FLOPs, communication, and memory, which
characterize runtime, energy, and storage respectively. The core
of ExtDict is a novel parametric data projection algorithm, called
Extensible Dictionary, that enables versatile and sparse representations
of the data to minimize this computational cost. We
show that ExtDict can achieve the optimal performance objective,
according to our quantified cost model, by platform-aware tuning
of the Extensible Dictionary parameters. An accompanying API
ensures automated applicability of ExtDict to various algorithms,
datasets, and platforms. Proof-of-concept evaluations of massive
and dense data on different platforms demonstrate more than an
order of magnitude improvement in performance compared to the
state-of-the-art, within guaranteed user-defined error bounds.