A Learned Performance Model for Tensor Processing Units

Amit Sabne

Charith Mendis

Mangpo Phothilimthana

Mike Burrows

Samuel J. Kaufman

Sudip Roy

Yanqi Zhou

MLSys (2021)

Google Scholar

Abstract

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks---tile-size selection and operator fusion---and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

A Learned Performance Model for Tensor Processing Units

Abstract

Research Areas

Meet the teams driving innovation

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

A Learned Performance Model for Tensor Processing Units

Abstract

Research Areas

Meet the teams driving innovation

AI/ML Foundations  & Capabilities