Jump to Content
Phitchaya Mangpo

Phitchaya Mangpo

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Neural architecture search (NAS) has become an increasingly important tool within the deep learning community in recent years, yielding many practical advancements in the design of deep neural network architectures. However, most existing approaches operate within highly structured design spaces, and hence (1) explore only a small fraction of the full search space of neural architectures while also (2) requiring significant manual effort from domain experts. In this work, we develop techniques that enable efficient NAS in a significantly larger design space. In particular, we propose to perform NAS in an abstract search space of program properties. Our key insights are as follows: (1) an abstract search space can be significantly smaller than the original search space, and (2) architectures with similar program properties should also have similar performance; thus, we can search more efficiently in the abstract search space. To enable this approach, we also introduce a novel efficient synthesis procedure, which performs the role of concretizing a set of promising program properties into a satisfying neural architecture. We implement our approach, αNAS, within an evolutionary framework, where the mutations are guided by the program properties. Starting with a ResNet-34 model, αNAS produces a model with slightly improved accuracy on CIFAR-10 but 96% fewer parameters. On ImageNet, αNAS is able to improve over Vision Transformer (30% fewer FLOPS and parameters), ResNet-50 (23% fewer FLOPS, 14% fewer parameters), and EfficientNet (7% fewer FLOPS and parameters) without any degradation in accuracy. View details
    GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation
    Thirimadura C. Yasendra Mendis
    2022 IEEE International Symposium on Workload Characterization (2022) (to appear)
    Preview abstract Analytical hardware performance models yield swift estimation of desired hardware performance metrics. However, developing these analytical models for modern processors with sophisticated microarchitectures is an extremely laborious task and requires a firm understanding of target microarchitecture's internal structure. In this paper, we introduce GRANITE, a new machine learning model that estimates the throughput of basic blocks across different microarchitectures. GRANITE uses a graph representation of basic blocks that captures both structural and data dependencies between instructions. This representation is processed using a graph neural network that takes advantage of the relational information captured in the graph and learns a~rich neural representation of the basic block that allows more precise throughput estimation. Our results establish a new state-of-the-art for basic block performance estimation with an average test error of 6.9% across a wide range of basic blocks and microarchitectures for the x86-64 target. Compared to recent work, this reduced the error by 1.7% wile improving training and inference throughput by approximately 3.0x. In addition, we propose the use of multi-task learning with independent multi-layer feed forward decoder networks. Our results show that this technique further improves precision of all learned models while significantly reducing per-microarchitecture training costs. We perform an extensive set of ablation studies and comparisons with prior work, concluding a set of methods to achieve high accuracy for basic block performance estimation. View details
    A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers
    Berkin Ilbeyi
    Bjarke Roune
    Blake Hechtman
    Emma Wang
    Karthik Srinivasa Murthy
    Mike Burrows
    Nikhil Sarda
    Rezsa Farahani
    Samuel J. Kaufman
    Shen Wang
    Sudip Roy
    Yuanzhong Xu
    PACT (2021)
    Preview abstract Search-based techniques have been demonstrated effective in solving complex optimization problems that arise in domain-specific compilers for machine learning (ML). Unfortunately, deploying such techniques in production compilers is impeded by two limitations. First, prior works require factorization of a computation graph into smaller subgraphs over which search is applied. This decomposition is not only non-trivial but also significantly limits the scope of optimization. Second, prior works require search to be applied in a single stage in the compilation flow, which does not fit with the multi-stage layered architecture of most production ML compilers. This paper presents XTAT, an autotuner for production ML compilers that can tune both graph-level and subgraph-level optimizations across multiple compilation stages. XTAT applies XTAT-M, a flexible search methodology that defines a search formulation for joint optimizations by accurately modeling the interactions between different compiler passes. XTAT tunes tensor layouts, operator fusion decisions, tile sizes, and code generation parameters in XLA, a production ML compiler, using various search strategies. In an evaluation across 150 ML training and inference models on Tensor Processing Units (TPUs) at Google, XTAT offers up to 2.4x and an average 5% execution time speedup over the heavily-optimized XLA compiler. View details
    Preview abstract One of the major optimizations employed in deep learning frameworks is graph rewriting. Production frameworks rely on heuristics to decide if rewrite rules should be applied and in which order. Prior research has shown that one can discover more optimal tensor computation graphs if we search for a better sequence of substitutions instead of relying on heuristics. However, we observe that existing approaches for tensor graph superoptimization both in production and research frameworks apply substitutions in a sequential manner. Such sequential search methods are sensitive to the order in which the substitutions are applied and often only explore a small fragment of the exponential space of equivalent graphs. This paper presents a novel technique for tensor graph superoptimization that employs equality saturation to apply all possible substitutions at once. We show that our approach can find optimized graphs with up to 16% speedup over state-of-the-art, while spending on average 48x less time optimizing. View details
    Preview abstract Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks---tile-size selection and operator fusion---and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive. View details
    Preview abstract Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as an minimization objective, or by autotuners to find an optimal configuration of a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for the Tensor Processing Unit (TPU). We train a neural network over kernel-level sub-graphs from the corpus and find that the learned model is competitive to a heavily-optimized analytical cost model used in the production XLA compiler. View details
    Graph Transformer: A Generalized Method for Computation Graph Optimizations
    Amirali Abdolrashidi
    Anna Darling Goldie
    Azalia Mirhoseini
    Daniel Wong
    Hanxiao Liu
    Qiumin Xu
    Shen Wang
    Sudip Roy
    (2020)
    Preview abstract Runtime and scalability of neural networks can be significantly affected by computational graph optimization during compilation. Most existing automated graph optimizations are impractical for deployment due to the significant amount of compute required and their inability to generalize to new, previously held-out graphs. To address both limitations, we propose an end-to-end deep reinforcement learning method named Graph Transformer (GTf), based on a scalable sequential attention mechanism over an inductive graph neural network that is transferable to new, unseen graphs. GTf generates decisions on the entire graph in a single-shot fashion, rather than on each individual node progressively, drastically speeding up the search compared to prior methods. Moreover, we propose recurrent attention layers to jointly optimize dependent graph optimization tasks and demonstrate 33%-60% speedup of three graph optimization tasks compared to Tensorflow default optimizations. On a diverse set of representative graphs consisting of 1k-80k nodes, including Inception-v3, Transformer-XL, and WaveNet, GTf achieves an average 21% improvement over human experts and 18% improvement over the prior art with 15x faster convergence, on a device placement task evaluated in real systems. View details
    Learned TPU Cost Model for XLA Tensor Programs
    Mike Burrows
    Samuel J. Kaufman
    Workshop on ML for Systems at NeurIPS (2019)
    Preview abstract At Google, we would like to develop a cost model that can accurately estimate the execution time of a machine learning model running on a Tensor Processing Unit (TPU). This cost model can be used by a compiler to make heuristic decisions, by an autotuner to find an optimal configuration of a specific program, and by Neural Architecture Search to co-optimize accuracy and inference time. However, building an accurate analytical cost model is challenging because of the complexity of modern processors. We propose to learn a cost model using a neural network. Our cost model uses a feedforward neural network to predict execution time from a graph embedding based on GraphSAGE. Our model’s mean predictions are within 13% of the actual execution time. View details
    No Results Found