Amir Yazdanbakhsh
I joined Google Research as a Research Scientist in 2019, following a one year AI residency. I am the co-founder and co-lead of the Machine Learning for Computer Architecture team. We leverage the recent machine learning methods and advancements to innovate and design better hardware accelerators. The work of our team has been covered by media outlets including ZDNet and InfoQ. I am also interested in designing large-scale distributed systems for training machine learning applications. To that end, I led the development of a massively large-scale distributed reinforcement learning system that scales to TPU Pod and efficiently manages thousands of actors to solve complex, real-world tasks. As a case study, our team demonstrates how using this highly scalable system enables reinforcement learning to accomplish chip placement in ~an hour instead of days or weeks by human effort. I received my Ph.D. degree in computer science from the Georgia Institute of Technology. My Ph.D. work has been recognized by various awards, including Microsoft PhD Fellowship and Qualcomm Innovation Fellowship.
Authored Publications
Sort By
Data-Driven Offline Optimization for Architecting Hardware Accelerators
Aviral Kumar
Sergey Levine
International Conference on Learning Representations 2022 (to appear)
Preview abstract
With the goal of achieving higher efficiency, the semiconductor industry has gradually reformed towards application-specific hardware accelerators. While such a paradigm shift is already starting to show promising results, designers need to spend considerable manual effort and perform large number of time-consuming simulations to find accelerators that can accelerate multiple target applications while obeying design constraints. Moreover, such a ``simulation-driven'' approach must be re-run from scratch every time the target applications or constraints change. An alternative paradigm is to use a ``data-driven'', offline approach that utilizes logged simulation data, to architect hardware accelerators, without needing any form of simulation. Such an approach not only alleviates the need to run time-consuming simulation, but also enables data reuse and applies even when target applications change. In this paper, we develop such a data-driven offline optimization method for designing hardware accelerators, PRIME, that enjoys all of these properties. Our approach learns a conservative, robust estimate of the desired cost function, utilizes infeasible points and optimizes the design against this estimate without any additional simulator queries during optimization.
View details
Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask
Sheng-Chun Kao
Shivani Agrawal
Suvinay Subramanian
Tushar Krishna
(2022) (to appear)
Preview abstract
Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DNNs). Among different categories of sparsity, structured sparsity has gained more attentions due to its efficient execution on modern accelerators. Particularly, N:M sparsity is attractive because there are already hardware accelerator architectures that can leverage few forms of N:M structured sparsity in the model to yield higher compute-efficiency. While there is a large body of work proposing various recipes for N:M structured sparsity training, compute-efficient training recipes for structured sparsity is rather a less explored territory. In this
work, we focus on N:M sparsity and extensively study and evaluate various training recipes for N:M sparsity in terms of the trade-off between model accuracy and compute training cost (FLOPs). Building upon this study, we propose two new decay-based pruning methods, namely “pruning mask decay” and “sparse structure decay”. Our evaluations indicate that these proposed methods consistently deliver SOTA model accuracy, comparable to unstructured sparsity, on a transformer-based model for translate task. The increase in the accuracy of the sparse model using the new training recipes comes at the cost of marginal increase in the total training
compute (FLOPs).
View details
Preview abstract
Self-attention is a key enabler to achieve the state-of-art accuracy with various transformer-based Natural Language Processing (NLP) models.
This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence.
Commonly, only a small subset of words correlate highly with the word under attention, which is only determined at runtime.
As such, a significant amount of computation due to low attention score is inconsequential and can potentially be pruned at runtime.
The challenge is finding the threshold for attention scores below which the following computation will be inconsequential.
Although threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training.
This formulation enables piggy backing on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously.
This analytical approach strikes a formally optimal balance between accuracy and computation pruning.
To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed \leopard\footnote{\leopard: \textbf{L}earning thr\textbf{E}sholds for \textbf{O}n-the-fly \textbf{P}runing \textbf{A}cceleration of t\textbf{R}ansformer mo\textbf{D}els.}, for transformer language models with bit-level early termination microarchitectural mechanism.
We evaluate our proposed mathematics and hardware across 38 target back-end tasks defined for \bench{MemN2N}, \bench{BERT-Base}, and \bench{BERT-Large} state-of-the-art transformer models.
Post-layout results show that, on average, \leopard yields \SpeedupOverBaseline and \EnergyOverBaseline speedup and energy reduction, respectively. These improvements are achieved while keeping the average accuracy virtually intact ($\leq 0.3\%$ loss).
View details
GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation
Mangpo Phothilimthana
Thirimadura C. Yasendra Mendis
2022 IEEE International Symposium on Workload Characterization (2022) (to appear)
Preview abstract
Analytical hardware performance models yield swift estimation of desired hardware performance metrics. However, developing these analytical models for modern processors with sophisticated microarchitectures is an extremely laborious task and requires a firm understanding of target microarchitecture's internal structure. In this paper, we introduce GRANITE, a new machine learning model that estimates the throughput of basic blocks across different microarchitectures. GRANITE uses a graph representation of basic blocks that captures both structural and data dependencies between instructions. This representation is processed using a graph neural network that takes advantage of the relational information captured in the graph and learns a~rich neural representation of the basic block that allows more precise throughput estimation. Our results establish a new state-of-the-art for basic block performance estimation with an average test error of 6.9% across a wide range of basic blocks and microarchitectures for the x86-64 target. Compared to recent work, this reduced the error by 1.7% wile improving training and inference throughput by approximately 3.0x. In addition, we propose the use of multi-task learning with independent multi-layer feed forward decoder networks. Our results show that this technique further improves precision of all learned models while significantly reducing per-microarchitecture training costs. We perform an extensive set of ablation studies and comparisons with prior work, concluding a set of methods to achieve high accuracy for basic block performance estimation.
View details
Preview abstract
Edge TPUs are a domain of accelerators for low-power,edge devices and are widely used in various Google productssuch as Coral devices and Pixel 4. In this paper, we first discussthe major microarchitectural details of Edge TPUs. Then, weextensively evaluate three classes of Edge TPUs, covering bothdata-center and mobile-SoC ecosystems, that are used or inthe pipeline to be used in Google products across 423K uniqueconvolutional neural networks. Building upon this extensive study,we discuss critical and interpretable microarchitectural insightsabout the studied classes of Edge TPUs. Finally, we present ourundergoing efforts in developing high-accuracy learned machinelearning models to estimate the major performance metrics ofEdge TPU accelerators. These learned models enable significantlyfaster (in the order of milliseconds) evaluations of acceleratorsas alternative to time-consuming cycle-accurate simulators andestablish an exciting opportunity for rapid hardware/softwareco-design.
View details
Apollo: Transferable Architecture Exploration
Albin Jones
Ravi Narayanaswami
Sat Chatterjee
ML for Systems Workshop at NeurIPS 2020
Preview abstract
The looming end of Moore's Law and ascending use of deep learning drives the design of custom accelerators that are optimized for specific neural architectures.
Accelerator design forms a challenging constrained optimization problem over a complex, high-dimensional and structured input space with a costly to evaluate objective function. Existing approaches for accelerator design are sample-inefficient do not transfer knowledge between related optimizations tasks with different design constraints (e.g. area budget) or neural architecture configurations. In this work, we propose a transferable architecture exploration framework, dubbed Apollo, that leverages recent advances in black-box function optimization for sample-efficient accelerator design. We use Apollo to optimize accelerator configurations of a diverse set of neural architectures with alternative design constraints. We show that Apollo finds optimal design configurations more sample-efficiently than baseline approaches. We further show that transferring knowledge between target architectures with different design constraints helps to find optimal configurations faster. This encouraging outcome portrays a promising path forward in shortening the timeline for accelerator design.
View details
ReLeQ: A Reinforcement Learning Approach for Automatic Deep Quantization of Neural Networks
Ahmed Taha Elthakeb
Prannoy Pilligundla
Fatemeh Mireshghallah
Hadi Esmaeilzadeh
IEEE Micro (2020)
Preview abstract
Deep Quantization can significantly reduce DNN computation and storage by decreasing the bitwidth of network encodings. However, without arduous manual effort, this deep quantization can lead to significant accuracy loss, leaving it in a position of questionable utility. We propose a systematic approach to tackle this problem, by automating the process of discovering the quantization levels through an end-to-end deep reinforcement learning framework (RELEQ). This framework utilizes the sample efficiency of Proximal Policy Optimization (PPO) to explore the exponentially large space of possible assignment of the quantization-levels to the layers. We show how RELEQ can balance speed and quality, and provide a heterogeneous bitwidth assignment for quantization of a large variety of deep networks that virtually preserves the accuracy (0.3% loss) while minimizes the computation and storage costs. With these DNNs, RELEQ enables conventional hardware and custom DNN accelerator to achieve 2.2 speedup over 8-bit execution.
View details
Mixed-Signal Charge-Domain Acceleration of Deep Neural Networks through Interleaved Bit-Partitioned Arithmetic
Soroush Ghodrati
Hardik Sharma
Sean Kinzer
Jongse Park
Nam Sung Kim
Doug Burger
Hadi Esmaeilzadeh
29th International Conference on Parallel Architectures and Compilation Techniques (PACT), IEEE (2020)
Preview abstract
Albeit low-power, mixed-signal circuitry suffers from significant overhead of Analog to Digital (A/D) conversion, limited range for
information encoding, and susceptibility to noise. This paper aims to address these challenges by offering and leveraging the following mathematical insight regarding vector dot-product—the basic operator in Deep Neural Networks (DNNs). This operator can be reformulated as a wide regrouping of spatially parallel low-bitwidth calculations that are interleaved across the bit partitions of multiple elements of the vectors. As such, the computational building block of our accelerator becomes a wide bit-interleaved analog vector unit comprising a collection of low-bitwidth multiply-accumulate modules that operate in the analog domain and share a single A/D converter (ADC). This bit-partitioning results in a lower-resolution ADC while the wide regrouping alleviates the need for A/D conversion per operation, amortizing its cost across multiple bit-partitions of the vector elements. Moreover, the low-bitwidth modules require smaller encoding range and also provide larger margins for noise mitigation. We also utilize the switched-capacitor design for our bit-level reformulation of DNN operations. The proposed switched-capacitor circuitry performs the regrouped multiplications in the charge domain and accumulates the results of the group in its capacitors over multiple cycles. The capacitive accumulation combined with wide bit-partitioned regrouping reduces the rate of A/D conversions, further improving the overall efficiency of the design. With such mathematical reformulation and its switched-capacitor implementation, we define one possible 3D-stacked microarchitecture, dubbed BiHiwe, that leverages clustering and hierarchical design to best utilize power-efficiency of the mixed-signal domain and 3D stacking. We also build models for noise, computational nonidealities, and variations. For ten DNN benchmarks, BiHiwe delivers 5.5×speedup over a leading purely-digital 3D-stacked accelerator Tetris, with a mere of less than 0.5% accuracy loss achieved by careful treatment of noise, computation error, and various forms of variation. Compared to RTX 2080 TI with tensor cores and Titan Xp GPUs, all with 8-bit execution, BiHiwe offers 35.4×and 70.1×higher
Performance-per-Watt, respectively. Relative to the mixed-signal RedEye, ISAAC, and PipeLayer, BiHiwe offers 5.5×, 3.6×, and 9.6× improvement in Performance-per-Watt respectively. The results suggest that BiHiwe is an effective initial step in a road that combines mathematics, circuits, and architecture.
View details
Efficient Imitation Learning with Local Trajectory Optimization
Jialin Song
Navdeep Jaitly
Azalia Mirhoseini
ICML 2020 Workshop on Inductive Biases, Invariances and Generalization in RL (2020)
Preview abstract
Imitation learning is a powerful approach to optimize sequential decision making policies from demonstrations. Most strategies in imitation learning rely on per-step supervision from pre-collected demonstrations as in behavioral cloning or from interactive expert policy queries such as DAgger. In this work, we present a unified view of behavioral cloning and DAgger through the lens of local trajectory optimization, which offers a means of interpolating between them. We provide theoretical justification for the proposed local trajectory optimization algorithm and show empirically that our method, POLISH (Policy Optimization by Local Improvement through Search), is much faster than methods that plan globally, speeding up training by a factor of up to 14 in wall clock time. Furthermore, the resulting policy outperforms strong baselines in both reinforcement learning and imitation learning.
View details