Jump to Content
Penporn Koanantakool

Penporn Koanantakool

I’m a high-performance computing (HPC) person picking up machine learning (ML). My current research interest is applying HPC techniques to make ML computations faster. I'm also interested in using ML to help optimize programs. I work on improving TensorFlow's performance.

I received my Ph.D. in Computer Science from the University of California, Berkeley in 2017, advised by Professor Kathy Yelick. My dissertation focused on avoiding communication in large-scale, scientific applications such as N-body algorithms and matrix computations on supercomputers to achieve highly-scalable and energy-efficient implementations. I received a B.Eng. in Computer Engineering from Kasetsart University in Bangkok, Thailand. I came to the United States for my graduate study on the Fulbright Scholarship.

My most recent project (prior to joining Google) was on massively parallel sparse inverse covariance matrix estimation (ICM). Sparse ICM is a popular tool for capturing the underlying dependency relationships in multivariate data. Unfortunately, most estimators are not scalable enough to handle the sizes of modern high-dimensional data sets. Our parallel proximal gradient method implementation, HP-CONCORD, demonstrates parallel scalability on tens of thousands of cores for problems with millions of dimensions. HP-CONCORD can be used to analyze real datasets, e.g., identifying the functional regions of the human brain from fMRI data. See more details, including the open source code, on HP-CONCORD’s webpage.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Compiler Support for Sparse Tensor Computations in MLIR
    Bixia Zheng
    Fredrik Kjolstad
    Nicolas Vasilache
    Tatiana Shpeisman
    ACM Transactions on Architecture and Code Optimization (2022) (to appear)
    Preview abstract Sparse tensors arise in problems in science, engineering, machine learning, and data analytics. Programs that operate on such tensors can exploit sparsity to reduce storage requirements and computational time. Developing and maintaining sparse software by hand, however, is a complex and error-prone task. Therefore, we propose to treat sparsity as a property, not a tedious implementation detail, and let a sparse compiler generate sparse code automatically from a sparsity-agnostic definition of the computation. This paper discusses the integration of this idea into MLIR. View details
    Mesh-TensorFlow: Deep Learning for Supercomputers
    Noam Shazeer
    Youlong Cheng
    Niki J. Parmar
    Dustin Tran
    Ashish Vaswani
    Peter Hawkins
    HyoukJoong Lee
    Mingsheng Hong
    Cliff Young
    Ryan Sepassi
    Black Hechtman
    NeurIPS (2018)
    Preview abstract Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the "batch" dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing SOTA results on WMT'14 English-to-French translation task and the one-billion-word Language modeling benchmark. Mesh-Tensorflow is available at https://github.com/tensorflow/mesh View details
    No Results Found