Philip Pham

Philip Pham

I am currently a software engineer at Waymo, where I apply machine learning to motion planning. Before, I worked in Google Research. My research focused on increasing model capacity and understanding inductive biases of neural networks. Natural language processing (NLP) was the main application area. Previously at Google, I worked on internal web applications. I studied mathematics at Duke University (B.S.) and University of Pennsylvania (M.A.) and statistics at the University of Washington.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Long Range Arena : A Benchmark for Efficient Transformers
    Yi Tay
    Samira Abnar
    Yikang Shen
    Jinfeng Rao
    Liu Yang
    Sebastian Ruder
    ICLR 2021 (to appear)
    Preview abstract Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable performance to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative performance amongst many models. This paper proposes a systematic and unified benchmark, LRA a benchmark specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens, encompassing a wide range of data types and modalities such as text, natural and synthetic images, and mathematical expressions requiring similarity, structural and visual-spatial reasoning. We systematically evaluate ten well established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers and Longformers) on our newly proposed benchmark suite. LRA paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. View details
    Preview abstract This paper proposes Omnidirectional Representations from Transformers (\textsc{OmniNet}). In OmniNet, instead of maintaing a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network. This process can also be interpreted as a form of extreme or intensive attention mechanism that has the receptive field of the entire width and depth of the network. To this end, the omnidirection attention is learned via a meta-learner, which is essentially another self-attention based model. In order to mitigate the computationally expensive costs of full receptive field attention, we leverage efficient self-attention models such as kernel-based \cite{choromanski2020rethinking}, low-rank attention \cite{wang2020linformer} and/or Big Bird \cite{zaheer2020big} as the meta-learner. We conduct extensive experiments on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA) and Image Recognition, showing that OmniNet not only achieves considerable improvements when equipped with both sequence-based (1D) Transformers but also on image recognition (finetuning and few shot learning) tasks. OmniNet also achieves state-of-the-art performance on LM1B, WMT'14 En-De/En-Fr and Long Range Arena. View details
    ReadTwice: Reading Very Large Documents with Memories
    Yury Zemlyanskiy
    Joshua Ainslie
    Michiel de Jong
    Ilya Eckstein
    Fei Sha
    Proceedings of NAACL(2021) (to appear)
    Preview abstract Knowledge-intensive tasks such as question answering often require assimilating information from different sections of large inputs, like books or collections of articles. We propose ReadTwice, a simple and effective approach to combine the advantages of existing approaches that modify Transformers to model long-range dependencies. The main idea is to read smaller segments of the text and summarize them into a memory table to be used in a second read of the text. We show that the model outperforms models of comparable size on several QA datasets and sets the state of the art on the challenging NarrativeQA dataset which asks questions about entire books. View details
    Preview abstract We present Neural Structured Learning (NSL) in TensorFlow, a new learning paradigm to train neural networks by leveraging structured signals in addition to feature inputs. Structure can be explicit as represented by a graph, or implicit, either induced by adversarial perturbation or inferred using techniques like embedding learning. NSL is open-sourced as part of the TensorFlow ecosystem and is widely used in Google across many products and services. In this tutorial, we provide an overview of the NSL framework including various libraries, tools, and APIs as well as demonstrate the practical use of NSL in different applications. The NSL website is hosted at www.tensorflow.org/neural_structured_learning, which includes details about the theoretical foundations of the technology, extensive API documentation, and hands-on tutorials. View details
    Big Bird: Transformers for Longer Sequences
    Guru Prashanth Guruganesh
    Avinava Dubey
    Joshua Ainslie
    Anirudh Ravula
    Qifan Wang
    Amr Mahmoud El Houssieny Ahmed
    NeurIPS(2020)
    Preview abstract Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (in terms of memory mainly) on the sequence length due to their full attention mechanism. To remedy this, we propose, \emph{BigBird}, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that \emph{BigBird} is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis demonstrates the need for having an O(1) global tokens, such as CLS, that attend to the entire sequence as part of the sparse attentions. We show that the proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, \emph{BigBird} drastically improves performance on various NLP tasks such as question answering. View details
    Preview abstract As machine learning has become more and more integrated into our businesses and lifestyles, researchers have begun to recognize the necessity of ensuring machine learning systems are fair. Recently, there has been an interest in defining a notion of fairness that mitigates over-representation in traditional clustering. In this paper we extend this notion to hierarchical clustering, where the goal is to recursively partition the data to optimize a certain objective~\cite{dasgupta}. For various natural objectives, we obtain simple, efficient algorithms to find a provably good fair hierarchical clustering. Empirically, we show that our algorithms can find a fair hierarchical clustering, surprisingly, with only a negligible loss in the objective. View details
    ETC: Encoding Long and Structured Inputs in Transformers
    Anirudh Ravula
    Joshua Ainslie
    Qifan Wang
    Vaclav Cvicek
    2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)
    Preview abstract Transformer models have advanced the state of the art in many NLP tasks. In this paper, we present a new Transformer architecture, Extended Transformer Construction (ETC), that addresses two key limitations of existing architectures, namely: scaling input length, and ingesting structured inputs. The main innovation is a new global-local attention mechanism between a global memory and the input tokens, which allows scaling attention to longer inputs. We show that combining global-local attention with relative position encodings and a Contrastive Predictive Coding (CPC) pre-training task allows ETC to naturally handle structured data. We achieve new state-of-the-art results on two natural language datasets requiring long and/or structured inputs. View details