Amit Sabne
Authored Publications
Sort By
A Learned Performance Model for Tensor Processing Units
Charith Mendis
Mangpo Phothilimthana
Mike Burrows
Samuel J. Kaufman
Sudip Roy
MLSys (2021)
Preview abstract
Accurate hardware performance models are critical to efficient code generation.
They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program.
However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden.
We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks---tile-size selection and operator fusion---and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.
View details
A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers
Berkin Ilbeyi
Bjarke Roune
Blake Hechtman
Emma Wang
Karthik Srinivasa Murthy
Mangpo Phothilimthana
Mike Burrows
Nikhil Sarda
Rezsa Farahani
Samuel J. Kaufman
Shen Wang
Sudip Roy
Yuanzhong Xu
PACT (2021)
Preview abstract
Search-based techniques have been demonstrated effective in solving complex optimization problems that arise in domain-specific compilers for machine learning (ML). Unfortunately, deploying such techniques in production compilers is impeded by two limitations. First, prior works require factorization of a computation graph into smaller subgraphs over which search is applied. This decomposition is not only non-trivial but also significantly limits the scope of optimization. Second, prior works require search to be applied in a single stage in the compilation flow, which does not fit with the multi-stage layered architecture of most production ML compilers.
This paper presents XTAT, an autotuner for production ML compilers that can tune both graph-level and subgraph-level optimizations across multiple compilation stages. XTAT applies XTAT-M, a flexible search methodology that defines a search formulation for joint optimizations by accurately modeling the interactions between different compiler passes. XTAT tunes tensor layouts, operator fusion decisions, tile sizes, and code generation parameters in XLA, a production ML compiler, using various search strategies. In an evaluation across 150 ML training and inference models on Tensor Processing Units (TPUs) at Google, XTAT offers up to 2.4x and an average 5% execution time speedup over the heavily-optimized XLA compiler.
View details
Preview abstract
XLA (accelerated linear algebra) is a compiler-based linear algebra execution engine. It is the backend that powers machine learning frameworks such as TensorFlow and JAX at Google, on a variety of devices including CPUs, GPUs, and TPUs. This talk will cover how ML growth has fueled accelerator architectures and the way XLA and related technologies help obtain high performance from the accelerators.
View details
Fast Distributed Bandits for Online Recommendation Systems
Kanak Mahadik
Qingyun Wu
Shuai Li
International Parallel & Distributed Processing Symposium, IEEE (2019) (to appear)
Preview abstract
Contextual bandit algorithms are commonly used in recommender systems, where content popularity
can change rapidly. These algorithms continuously learn good mappings between users and items, based on contexts associated with both the users and items. Recent recommendation algorithms that learn clustering or social structures between users have exhibited higher recommendation accuracy. However, as the number of users and items in the environment increases, the time required to
generate recommendations deteriorates significantly. As a result, these cannot be deployed in practice. The state-ofthe-art distributed bandit algorithm DCCB relies on a peerto-peer network to share information among distributed workers. However, this approach does not scale well with the increasing number of users. Furthermore, it suffers from slow discovery of clusters, resulting in accuracy
degradation. To address the above issues we propose a novel distributed bandit-based algorithm called DistCLUB. This algorithm lazily creates clusters in a distributed manner, and dramatically reduces the network data sharing requirement, achieving high scalability. Secondly, DistCLUB finds clusters much faster, achieving better accuracy than the state-of-the-art algorithm. Evaluation over both real-world benchmark and synthetic datasets show that DistCLUB is on average 8.87x faster than DCCB, and achieves 14.5% higher prediction performance.
View details