Anna Darling Goldie
Anna joined Google in 2013 and is a Senior Staff Research Scientist in Google DeepMind. At MIT, she earned a Bachelors and Masters in Computer Science, as well as a Bachelors in Linguistics. She is currently a CS PhD candidate in the Stanford NLP Group.
Authored Publications
Sort By
Graph Transformer: A Generalized Method for Computation Graph Optimizations
Amirali Abdolrashidi
Azalia Mirhoseini
Daniel Wong
Hanxiao Liu
Mangpo Phothilimthana
Qiumin Xu
Shen Wang
Sudip Roy
(2020)
Preview abstract
Runtime and scalability of neural networks can be significantly affected by computational graph optimization during compilation. Most existing automated graph optimizations are impractical for deployment due to the significant amount of compute required and their inability to generalize to new, previously held-out graphs. To address both limitations, we propose an end-to-end deep reinforcement learning method named Graph Transformer (GTf), based on a scalable sequential attention mechanism over an inductive graph neural network that is transferable to new, unseen graphs. GTf generates decisions on the entire graph in a single-shot fashion, rather than on each individual node progressively, drastically speeding up the search compared to prior methods. Moreover, we propose recurrent attention layers to jointly optimize dependent graph optimization tasks and demonstrate 33%-60% speedup of three graph optimization tasks compared to Tensorflow default optimizations. On a diverse set of representative graphs consisting of 1k-80k nodes, including Inception-v3, Transformer-XL, and WaveNet, GTf achieves an average 21% improvement over human experts and 18% improvement over the prior art with 15x faster convergence, on a device placement task evaluated in real systems.
View details
Efficient Imitation Learning with Local Trajectory Optimization
Jialin Song
Navdeep Jaitly
Azalia Mirhoseini
ICML 2020 Workshop on Inductive Biases, Invariances and Generalization in RL (2020)
Preview abstract
Imitation learning is a powerful approach to optimize sequential decision making policies from demonstrations. Most strategies in imitation learning rely on per-step supervision from pre-collected demonstrations as in behavioral cloning or from interactive expert policy queries such as DAgger. In this work, we present a unified view of behavioral cloning and DAgger through the lens of local trajectory optimization, which offers a means of interpolating between them. We provide theoretical justification for the proposed local trajectory optimization algorithm and show empirically that our method, POLISH (Policy Optimization by Local Improvement through Search), is much faster than methods that plan globally, speeding up training by a factor of up to 14 in wall clock time. Furthermore, the resulting policy outperforms strong baselines in both reinforcement learning and imitation learning.
View details
Preview abstract
Graph partitioning is the problem of dividing the nodes of a graph into balanced partitions while minimizing the edge cut across the partitions. Due to its combinatorial
nature, many approximate solutions have been developed. We propose GAP, a Generalizable Approximate Partitioning framework that takes a deep learning approach
to graph partitioning. We define a differentiable loss function that represents the
partitioning objective. Unlike baselines that redo the optimization per graph, GAP
is capable of generalization, allowing us to train models that produce performant
partitions at inference time, even on unseen graphs. Furthermore, because we learn
the representation of the graph while jointly optimizing for the partitioning loss
function, GAP can be easily tuned for a variety of graph structures. We evaluate the
performance of GAP on graphs of varying sizes and structures, including graphs
of widely used machine learning models (e.g., ResNet, VGG, and Inception-V3),
scale-free graphs, and random graphs.
View details
Preview abstract
We introduce a hierarchical model for efficient placement of computational graphs
onto hardware devices, especially in heterogeneous environments with a mixture of
CPUs, GPUs, and other computational devices. Our method learns to assign graph
operations to groups and to allocate those groups to available devices. The grouping
and device allocations are learned jointly. The proposed method is trained with
policy gradient and requires no human intervention. Experiments with widely-used
computer vision and natural language models show that our algorithm can find
optimized, non-trivial placements for TensorFlow computational graphs with over
80,000 operations. In addition, our approach outperforms placements by human
experts as well as a previous state-of-the-art placement method based on deep
reinforcement learning. Our method achieves runtime reductions of up to 60.6%
per training step when applied to models such as Neural Machine Translation.
View details
Preview abstract
Building general-purpose conversation agents is a very challenging task, but necessary on the road toward intelligent agents that can interact with humans in natural language. Neural conversation models -- purely data-driven systems trained end-to-end on dialogue corpora -- have shown great promise recently, yet they often produce short and generic responses. This work presents new training and decoding methods that improve the quality, coherence, and diversity of long responses generated using sequence-to-sequence models. Our approach adds self-attention to the decoder to maintain coherence in longer responses, and we propose a practical approach, called the glimpse-model, for scaling to large datasets. We introduce a stochastic beam-search algorithm with segment-by-segment reranking which lets us inject diversity earlier in the generation process. We trained on a combined data set of over 2.3B conversation messages mined from the web. In human evaluation studies, our method produces longer responses overall, with a higher proportion rated as acceptable and excellent as length increases, compared to baseline sequence-to-sequence models with explicit length-promotion. A back-off strategy produces better responses overall, in the full spectrum of lengths.
View details
Preview abstract
Neural Machine Translation (NMT) has shown remarkable progress over the past few years with production systems now being deployed to end-users. One major drawback of current architectures is that they are expensive to train, typically requiring days to weeks of GPU time to converge. This makes exhaustive hyperparameter search, as is commonly done with other neural network architectures, prohibitively expensive. In this work, we present the first large-scale analysis of NMT architecture hyperparameters. We report empirical results and variance numbers for several hundred experimental runs, corresponding to over 250,000 GPU hours on the standard WMT English to German translation task. Our experiments lead to novel insights and practical advice for building and extending NMT architectures. As part of this contribution, we release an open-source NMT framework that enables researchers to easily experiment with novel techniques and reproduce state of the art results.
View details