Jump to Content
Amir Yazdanbakhsh

Amir Yazdanbakhsh

I joined Google Research as a Research Scientist in 2019, following a one year AI residency. I am the co-founder and co-lead of the Machine Learning for Computer Architecture team. We leverage the recent machine learning methods and advancements to innovate and design better hardware accelerators. The work of our team has been covered by media outlets including ZDNet and InfoQ. I am also interested in designing large-scale distributed systems for training machine learning applications. To that end, I led the development of a massively large-scale distributed reinforcement learning system that scales to TPU Pod and efficiently manages thousands of actors to solve complex, real-world tasks. As a case study, our team demonstrates how using this highly scalable system enables reinforcement learning to accomplish chip placement in ~an hour instead of days or weeks by human effort. I received my Ph.D. degree in computer science from the Georgia Institute of Technology. My Ph.D. work has been recognized by various awards, including Microsoft PhD Fellowship and Qualcomm Innovation Fellowship.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation
    Thirimadura C. Yasendra Mendis
    2022 IEEE International Symposium on Workload Characterization (2022) (to appear)
    Preview abstract Analytical hardware performance models yield swift estimation of desired hardware performance metrics. However, developing these analytical models for modern processors with sophisticated microarchitectures is an extremely laborious task and requires a firm understanding of target microarchitecture's internal structure. In this paper, we introduce GRANITE, a new machine learning model that estimates the throughput of basic blocks across different microarchitectures. GRANITE uses a graph representation of basic blocks that captures both structural and data dependencies between instructions. This representation is processed using a graph neural network that takes advantage of the relational information captured in the graph and learns a~rich neural representation of the basic block that allows more precise throughput estimation. Our results establish a new state-of-the-art for basic block performance estimation with an average test error of 6.9% across a wide range of basic blocks and microarchitectures for the x86-64 target. Compared to recent work, this reduced the error by 1.7% wile improving training and inference throughput by approximately 3.0x. In addition, we propose the use of multi-task learning with independent multi-layer feed forward decoder networks. Our results show that this technique further improves precision of all learned models while significantly reducing per-microarchitecture training costs. We perform an extensive set of ablation studies and comparisons with prior work, concluding a set of methods to achieve high accuracy for basic block performance estimation. View details
    Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask
    Sheng-Chun Kao
    Shivani Agrawal
    Suvinay Subramanian
    Tushar Krishna
    (2022) (to appear)
    Preview abstract Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DNNs). Among different categories of sparsity, structured sparsity has gained more attentions due to its efficient execution on modern accelerators. Particularly, N:M sparsity is attractive because there are already hardware accelerator architectures that can leverage few forms of N:M structured sparsity in the model to yield higher compute-efficiency. While there is a large body of work proposing various recipes for N:M structured sparsity training, compute-efficient training recipes for structured sparsity is rather a less explored territory. In this work, we focus on N:M sparsity and extensively study and evaluate various training recipes for N:M sparsity in terms of the trade-off between model accuracy and compute training cost (FLOPs). Building upon this study, we propose two new decay-based pruning methods, namely “pruning mask decay” and “sparse structure decay”. Our evaluations indicate that these proposed methods consistently deliver SOTA model accuracy, comparable to unstructured sparsity, on a transformer-based model for translate task. The increase in the accuracy of the sparse model using the new training recipes comes at the cost of marginal increase in the total training compute (FLOPs). View details
    Accelerating Attention through Gradient-Based Learned Runtime Pruning
    Hadi Esmaeilzadeh
    Mingu Kang
    Soroush Ghodrati
    Zheng Li
    ISCA (2022) (to appear)
    Preview abstract Self-attention is a key enabler to achieve the state-of-art accuracy with various transformer-based Natural Language Processing (NLP) models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words correlate highly with the word under attention, which is only determined at runtime. As such, a significant amount of computation due to low attention score is inconsequential and can potentially be pruned at runtime. The challenge is finding the threshold for attention scores below which the following computation will be inconsequential. Although threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation enables piggy backing on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously. This analytical approach strikes a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed \leopard\footnote{\leopard: \textbf{L}earning thr\textbf{E}sholds for \textbf{O}n-the-fly \textbf{P}runing \textbf{A}cceleration of t\textbf{R}ansformer mo\textbf{D}els.}, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our proposed mathematics and hardware across 38 target back-end tasks defined for \bench{MemN2N}, \bench{BERT-Base}, and \bench{BERT-Large} state-of-the-art transformer models. Post-layout results show that, on average, \leopard yields \SpeedupOverBaseline and \EnergyOverBaseline speedup and energy reduction, respectively. These improvements are achieved while keeping the average accuracy virtually intact ($\leq 0.3\%$ loss). View details
    Data-Driven Offline Optimization for Architecting Hardware Accelerators
    Aviral Kumar
    Sergey Levine
    International Conference on Learning Representations 2022 (to appear)
    Preview abstract With the goal of achieving higher efficiency, the semiconductor industry has gradually reformed towards application-specific hardware accelerators. While such a paradigm shift is already starting to show promising results, designers need to spend considerable manual effort and perform large number of time-consuming simulations to find accelerators that can accelerate multiple target applications while obeying design constraints. Moreover, such a ``simulation-driven'' approach must be re-run from scratch every time the target applications or constraints change. An alternative paradigm is to use a ``data-driven'', offline approach that utilizes logged simulation data, to architect hardware accelerators, without needing any form of simulation. Such an approach not only alleviates the need to run time-consuming simulation, but also enables data reuse and applies even when target applications change. In this paper, we develop such a data-driven offline optimization method for designing hardware accelerators, PRIME, that enjoys all of these properties. Our approach learns a conservative, robust estimate of the desired cost function, utilizes infeasible points and optimizes the design against this estimate without any additional simulator queries during optimization. View details
    Preview abstract Edge TPUs are a domain of accelerators for low-power,edge devices and are widely used in various Google productssuch as Coral devices and Pixel 4. In this paper, we first discussthe major microarchitectural details of Edge TPUs. Then, weextensively evaluate three classes of Edge TPUs, covering bothdata-center and mobile-SoC ecosystems, that are used or inthe pipeline to be used in Google products across 423K uniqueconvolutional neural networks. Building upon this extensive study,we discuss critical and interpretable microarchitectural insightsabout the studied classes of Edge TPUs. Finally, we present ourundergoing efforts in developing high-accuracy learned machinelearning models to estimate the major performance metrics ofEdge TPU accelerators. These learned models enable significantlyfaster (in the order of milliseconds) evaluations of acceleratorsas alternative to time-consuming cycle-accurate simulators andestablish an exciting opportunity for rapid hardware/softwareco-design. View details
    Efficient Imitation Learning with Local Trajectory Optimization
    Jialin Song
    Anna Darling Goldie
    Navdeep Jaitly
    Azalia Mirhoseini
    ICML 2020 Workshop on Inductive Biases, Invariances and Generalization in RL (2020)
    Preview abstract Imitation learning is a powerful approach to optimize sequential decision making policies from demonstrations. Most strategies in imitation learning rely on per-step supervision from pre-collected demonstrations as in behavioral cloning or from interactive expert policy queries such as DAgger. In this work, we present a unified view of behavioral cloning and DAgger through the lens of local trajectory optimization, which offers a means of interpolating between them. We provide theoretical justification for the proposed local trajectory optimization algorithm and show empirically that our method, POLISH (Policy Optimization by Local Improvement through Search), is much faster than methods that plan globally, speeding up training by a factor of up to 14 in wall clock time. Furthermore, the resulting policy outperforms strong baselines in both reinforcement learning and imitation learning. View details
    Preview abstract The looming end of Moore's Law and ascending use of deep learning drives the design of custom accelerators that are optimized for specific neural architectures. Accelerator design forms a challenging constrained optimization problem over a complex, high-dimensional and structured input space with a costly to evaluate objective function. Existing approaches for accelerator design are sample-inefficient do not transfer knowledge between related optimizations tasks with different design constraints (e.g. area budget) or neural architecture configurations. In this work, we propose a transferable architecture exploration framework, dubbed Apollo, that leverages recent advances in black-box function optimization for sample-efficient accelerator design. We use Apollo to optimize accelerator configurations of a diverse set of neural architectures with alternative design constraints. We show that Apollo finds optimal design configurations more sample-efficiently than baseline approaches. We further show that transferring knowledge between target architectures with different design constraints helps to find optimal configurations faster. This encouraging outcome portrays a promising path forward in shortening the timeline for accelerator design. View details
    Menger: Massively Large-Scale Distributed Reinforcement Learning
    Junchao Chen
    Yu Zheng
    NeurIPS, Beyond Backpropagation Workshop, 2020 (2020)
    ReLeQ: A Reinforcement Learning Approach for Automatic Deep Quantization of Neural Networks
    Ahmed Taha Elthakeb
    Prannoy Pilligundla
    Fatemeh Mireshghallah
    Hadi Esmaeilzadeh
    IEEE Micro (2020)
    Preview abstract Deep Quantization can significantly reduce DNN computation and storage by decreasing the bitwidth of network encodings. However, without arduous manual effort, this deep quantization can lead to significant accuracy loss, leaving it in a position of questionable utility. We propose a systematic approach to tackle this problem, by automating the process of discovering the quantization levels through an end-to-end deep reinforcement learning framework (RELEQ). This framework utilizes the sample efficiency of Proximal Policy Optimization (PPO) to explore the exponentially large space of possible assignment of the quantization-levels to the layers. We show how RELEQ can balance speed and quality, and provide a heterogeneous bitwidth assignment for quantization of a large variety of deep networks that virtually preserves the accuracy (0.3% loss) while minimizes the computation and storage costs. With these DNNs, RELEQ enables conventional hardware and custom DNN accelerator to achieve 2.2 speedup over 8-bit execution. View details
    Mixed-Signal Charge-Domain Acceleration of Deep Neural Networks through Interleaved Bit-Partitioned Arithmetic
    Soroush Ghodrati
    Hardik Sharma
    Sean Kinzer
    Jongse Park
    Nam Sung Kim
    Doug Burger
    Hadi Esmaeilzadeh
    29th International Conference on Parallel Architectures and Compilation Techniques (PACT), IEEE (2020)
    Preview abstract Albeit low-power, mixed-signal circuitry suffers from significant overhead of Analog to Digital (A/D) conversion, limited range for information encoding, and susceptibility to noise. This paper aims to address these challenges by offering and leveraging the following mathematical insight regarding vector dot-product—the basic operator in Deep Neural Networks (DNNs). This operator can be reformulated as a wide regrouping of spatially parallel low-bitwidth calculations that are interleaved across the bit partitions of multiple elements of the vectors. As such, the computational building block of our accelerator becomes a wide bit-interleaved analog vector unit comprising a collection of low-bitwidth multiply-accumulate modules that operate in the analog domain and share a single A/D converter (ADC). This bit-partitioning results in a lower-resolution ADC while the wide regrouping alleviates the need for A/D conversion per operation, amortizing its cost across multiple bit-partitions of the vector elements. Moreover, the low-bitwidth modules require smaller encoding range and also provide larger margins for noise mitigation. We also utilize the switched-capacitor design for our bit-level reformulation of DNN operations. The proposed switched-capacitor circuitry performs the regrouped multiplications in the charge domain and accumulates the results of the group in its capacitors over multiple cycles. The capacitive accumulation combined with wide bit-partitioned regrouping reduces the rate of A/D conversions, further improving the overall efficiency of the design. With such mathematical reformulation and its switched-capacitor implementation, we define one possible 3D-stacked microarchitecture, dubbed BiHiwe, that leverages clustering and hierarchical design to best utilize power-efficiency of the mixed-signal domain and 3D stacking. We also build models for noise, computational nonidealities, and variations. For ten DNN benchmarks, BiHiwe delivers 5.5×speedup over a leading purely-digital 3D-stacked accelerator Tetris, with a mere of less than 0.5% accuracy loss achieved by careful treatment of noise, computation error, and various forms of variation. Compared to RTX 2080 TI with tensor cores and Titan Xp GPUs, all with 8-bit execution, BiHiwe offers 35.4×and 70.1×higher Performance-per-Watt, respectively. Relative to the mixed-signal RedEye, ISAAC, and PipeLayer, BiHiwe offers 5.5×, 3.6×, and 9.6× improvement in Performance-per-Watt respectively. The results suggest that BiHiwe is an effective initial step in a road that combines mathematics, circuits, and architecture. View details
    Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation
    Byung Hoon Ahn
    Prannoy Pilligundla
    Hadi Esmaeilzadeh
    International Conference on Learning Representations (2020) (to appear)
    Preview abstract Achieving faster execution with shorter compilation time can foster further diversity and innovation in neural networks. However, the current paradigm of executing neural networks either relies on hand-optimized libraries, traditional compilation heuristics, or very recently genetic algorithms and other stochastic methods. These methods suffer from frequent costly hardware measurements rendering them not only too time consuming but also suboptimal. As such, we devise a solution that can learn to quickly adapt to a previously unseen design space for code optimization, both accelerating the search and improving the output performance. This solution dubbed Chameleon leverages reinforcement learning whose solution takes fewer steps to converge, and develops an adaptive sampling algorithm that not only focuses on the costly samples (real hardware measurements) on representative points but also uses a domain-knowledge inspired logic to improve the samples itself. Experimentation with real hardware shows that Chameleon provides 4.45x speed up in optimization time over AutoTVM, while also improving inference time of the modern deep networks by 5.6%. View details
    AxMemo: Hardware-Compiler Co-design for Approximate Code Memoization
    Zhenhong Liu
    Dong Kai Wang
    Hadi Esmaeilzadeh
    Nam Sung
    Proceedings of the 46th International Symposium on Computer Architecture, 2019, IEEE, 685–697
    Preview abstract Historically, continuous improvements in general-purpose processors have fueled the economic success and growth of the IT industry. However, the diminishing benefits from transistor scaling and conventional optimization techniques necessitates moving beyond common practices. Approximate computing is one such unconventional technique that has shown promise in pushing the boundaries of general-purpose processing. This paper sets out to employ approximation for processors that are commonly used in cyber-physical domains and may become building blocks of Internet of Things. To this end, we propose AxMemo to exploit the computation redundancy that stems from data similarity in the inputs of code blocks. Such input behavior is prevalent in cyber-physical systems as they deal with real-world data that naturally harbors redundancy. Therefore, in contrast to existing memoization techniques that replace costly floating-point arithmetic operations with limited number of inputs, AxMemo focuses on memoizing blocks of code with potentially many inputs. As such, AxMemo aims to replace long sequences of instructions with a few hash and lookup operations. By reducing the number of dynamic instructions, AxMemo alleviates the von Neumann and execution overheads of passing instructions through the processor pipeline altogether. The challenge AxMemo facing is to provide low-cost hashing mechanisms that can generate rather unique signature for each multi-input combination. To address this challenge, we develop a novel use of Cyclic Redundancy Checking (CRC) to hash the inputs. To increase lookup table hit rate, AxMemo employs a two-level memoization lookup, which utilizes small dedicated SRAM and spare storage in the last level cache. These solutions enable AxMemo to efficiently memoize relatively large code regions with variable input sizes and types using the same underlying hardware. Our experiment shows that AxMemo offers 2.64× speedup and 2.58 × energy reduction with mere 0.2% of quality loss averaged across ten benchmarks. These benefits come with an area overhead of just 2.1%. View details
    ReLeQ: A Reinforcement Learning Approach for Deep Quantization of Neural Networks
    Ahmed T. Elthakeb
    Prannoy Pilligundla
    FatemehSadat Mireshghallah
    Hadi Esmaeilzadeh
    NeurIPS (2018)
    Preview abstract Despite numerous state-of-the-art applications of Deep Neural Networks (DNNs) in a wide range of real-world tasks, two major challenges hinder further advances in DNNs: hyperparameter optimization and constrained power resources, which is a significant concern in embedded devices. DNNs become increasingly difficult to train and deploy as they grow in size due to both computational intensity and the large memory footprint. Recent efforts show that quantizing weights of deep neural networks to lower bitwidths takes a significant step toward mitigating the mentioned issues, by reducing memory bandwidth and using limited computational resources which is important for deploying DNN models to devices with limited resources. This paper builds upon the algorithmic insight that the bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. Deep quantization (quantizing bitwidths below eight) while maintaining accuracy, requires magnificent manual effort and hyper-parameter tuning as well as re-training. This paper tackles the aforementioned problems by designing an end to end framework, dubbed ReLeQ, to automate DNN quantization. We formulate DNN quantization as an optimization problem and use a state-of-the-art policy gradient based Reinforcement Learning (RL) algorithm, Proximal Policy Optimization (PPO) to efficiently explore the large design space of DNN quantization and solve the defined optimization problem. To show the effectiveness of ReLeQ, we evaluated it across several neural networks including MNIST, CIFAR10, SVHN. ReLeQ quantizes the weights of these networks to average bitwidths of 2.25, 5 and 4 respectively while maintaining the final accuracy loss below 0.3%. View details
    No Results Found