Andrew M. Dai
Andrew is a software engineer at Google. Prior to that he completed a PhD in machine learning at the University of Edinburgh and a MA in computer science at the University of Cambridge.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
MaMMUT: A Simple Vision-Encoder Text-Decoder Architecture for MultiModal Tasks
Wei Li
Abhijit Ogale
Transactions on Machine Learning Research (2023)
Preview abstract
The development of language models have moved from encoder-decoder to decoder-only designs. In addition, the common knowledge has it that the two most popular multimodal tasks, the generative and contrastive tasks, tend to conflict with one another, are hard to accommodate in one architecture, and further need complex adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach.
View details
Mind's Eye: Grounded Language Model Reasoning through Simulation
Jason Wei
Shixiang Shane Gu
Soroush Vosoughi
ICLR 2023 (2022)
Preview abstract
Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world—their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind’s MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100× larger. Finally, we confirm the robustness of Mind's Eye through ablation studies.
View details
Sparsely Activated Language Models are Efficient In-Context Learners
Barret Richard Zoph
Dmitry (Dima) Lepikhin
Emma Wang
Kun Zhang
Liam B. Fedus
Maarten Paul Bosma
Marie Pellat
Maxim Krikun
Nan Du
Simon Tong
Tao Wang
Toju Duke
Yuanzhong Xu
Zongwei Zhou
(2022)
Preview abstract
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong performance on few-shot learning. However, training these large dense models require significant amounts of computing resources. In this paper, we develop a family of sparsely activated mixture-of-expert language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which can have many more parameters but require significant less training cost than dense models. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3 but can be trained more efficiently. With only 1/3 of energy consumption to train GPT-3, \glam achieves better overall performance on 29 zero-shot and one-shot NLP tasks. For example, \glam gets 75.0\% one-shot exact match accuracy on the TriviaQA test server, a significant improvement over 68.0\% obtained by GPT-3.
View details
Finetuned Language Models are Zero-Shot Learners
Jason Wei
Maarten Paul Bosma
Vincent Zhao
Nan Du
International Conference on Learning Representations (2022)
Preview abstract
This paper explores a simple method for improving the zero-shot learning abilities of language models.
We show that instruction tuning---finetuning language models on a collection of tasks described via instructions---substantially boosts zero-shot performance on unseen tasks.
We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of tasks and model scale are key components to the success of instruction tuning.
View details
PaLM: Scaling Language Modeling with Pathways
Sharan Narang
Jacob Devlin
Maarten Bosma
Hyung Won Chung
Sebastian Gehrmann
Parker Schuh
Sasha Tsvyashchenko
Abhishek Rao
Yi Tay
Noam Shazeer
Nan Du
Reiner Pope
James Bradbury
Guy Gur-Ari
Toju Duke
Henryk Michalewski
Xavier Garcia
Liam Fedus
David Luan
Barret Zoph
Ryan Sepassi
David Dohan
Shivani Agrawal
Mark Omernick
Marie Pellat
Aitor Lewkowycz
Erica Moreira
Rewon Child
Oleksandr Polozov
Zongwei Zhou
Michele Catasta
Jason Wei
arxiv:2204.02311 (2022)
Preview abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
View details
Evaluation of US State-Based Policy Interventions on Social Distancing Using Aggregated Mobility Data during the COVID-19 Pandemic
Gregory Alexander Wellenius
Swapnil Suresh Vispute
Valeria Espinosa
Thomas Tsai
Jonathan Hennessy
Krishna Kumar Gadepalli
Adam Boulanger
Adam Pearce
Chaitanya Kamath
Arran Schlosberg
Catherine Bendebury
Chinmoy Mandayam
Charlotte Stanton
Shailesh Bavadekar
Christopher David Pluntke
Damien Desfontaines
Benjamin H. Jacobson
Zan Armstrong
Katherine Chou
Andrew Nathaniel Oplinger
Ashish K. Jha
Evgeniy Gabrilovich
Nature Communications (2021)
Training independent subnetworks for robust prediction
Marton Havasi
Rodolphe Jenatton
Stanislav Fort
International Conference on Learning Representations (2021)
Preview abstract
Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant runtime cost. In this work, we show a surprising result: the benefits of using multiple predictions can be achieved 'for free' under a single model's forward pass. In particular, we show that, using a multi-input multi-output (MIMO) configuration, one can utilize a single model's capacity to train multiple subnetworks that independently learn the task at hand. By ensembling the predictions made by the subnetworks, we improve model robustness without increasing compute. We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, ImageNet, and their out-of-distribution variants compared to previous methods.
View details
Preview abstract
The paradigm of pretraining' from a set of relevant auxiliary tasks and thenfinetuning' on a target task has been successfully applied in many different domains. However, when the auxiliary tasks are abundant, with complex relationships to the target task, using domain knowledge or searching over all possible pretraining setups are inefficient strategies. To address this challenge, we propose a method to automatically select from a large set of auxiliary tasks which yield a representation most useful to the target task. In particular, we develop an efficient algorithm that uses automatic auxiliary task selection within a nested-loop meta-learning process. We have applied this algorithm to the task of clinical outcome predictions in electronic medical records, learning from a large number of self-supervised tasks related to forecasting patient trajectories. Experiments on a real clinical dataset demonstrate the superior predictive performance of our method compared to direct supervised learning, naive pretraining and multitask learning, in particular in low-data scenarios when the primary task has very few examples. With detailed ablation analysis, we further show that the selection rules are interpretable and able to generalize to unseen target tasks with new data.
View details
Predicting inpatient medication orders from electronic health record data
Kathryn Rough
Kun Zhang
Atul J. Butte
Alvin Rajkomar
Clinical Pharmacology and Therapeutics (2020)
Preview abstract
In a general inpatient population, we predicted patient‐specific medication orders based on structured information in the electronic health record (EHR). Data on over three million medication orders from an academic medical center were used to train two machine‐learning models: A deep learning sequence model and a logistic regression model. Both were compared with a baseline that ranked the most frequently ordered medications based on a patient’s discharge hospital service and amount of time since admission. Models were trained to predict from 990 possible medications at the time of order entry. Fifty‐five percent of medications ordered by physicians were ranked in the sequence model’s top‐10 predictions (logistic model: 49%) and 75% ranked in the top‐25 (logistic model: 69%). Ninety‐three percent of the sequence model’s top‐10 prediction sets contained at least one medication that physicians ordered within the next day. These findings demonstrate that medication orders can be predicted from information present in the EHR.
View details
Learning the Graphical Structure of Electronic Health Records with Graph Convolutional Transformer
Edward Choi
Zhen Xu
Yujia Li
Gerardo Flores
Association for the Advancement of Artificial Intelligence (AAAI) (2020)
Preview abstract
Effective modeling of electronic health records (EHR) is rapidly becoming an important topic in both academia and industry. A recent study showed that using the graphical structure underlying EHR data (e.g. relationship between diagnoses and treatments) improves the performance of prediction tasks such as heart failure prediction. However, EHR data do not always contain complete structure information. Moreover, when it comes to claims data, structure information is completely unavailable to begin with. Under such circumstances, can we still do better than just treating EHR data as a flat-structured bag-of-features? In this paper, we study the possibility of jointly learning the hidden structure of EHR while performing supervised prediction tasks on EHR data. Specifically, we discuss that Transformer is a suitable basis model to learn the hidden EHR structure, and propose Graph Convolutional Transformer, which uses data statistics to guide the structure learning process. The proposed model consistently outperformed previous approaches empirically, on both synthetic data and publicly available EHR data, for various prediction tasks such as graph reconstruction and readmission prediction, indicating that it can serve as an effective general-purpose representation learning algorithm for EHR data.
View details
Preview abstract
Capturing the inter-dependencies among multiple types of clinically-critical events is critical not only to accurate future event prediction, but also to better treatment planning. In this work, we propose a deep latent state-space generative model to capture the interactions among different types of correlated clinical events (e.g., kidney failure, mortality) by explicitly modeling the temporal dynamics of patients' latent states. Based on these learned patient states, we further develop a new general discrete-time formulation of the hazard rate function to estimate the survival distribution of patients with significantly improved accuracy. Extensive evaluations over real EMR data show that our proposed model compares favorably to various state-of-the-art baselines. Furthermore, our method also uncovers meaningful insights about the latent correlations among mortality and different types of organ failures.
View details
Analyzing the Role of Model Uncertainty for Electronic Health Records
Edward Choi
Jeremy Nixon
Ghassen Jerfel
ACM Conference on Health, Inference, and Learning (ACM CHIL) (2020)
Preview abstract
In medicine, both ethical and monetary costs of incorrect predictions can be significant, and the complexity of the problems often necessitates increasingly complex models. Recent work has shown that changing just the random seed is enough for otherwise well-tuned deep neural networks to vary in their individual predicted probabilities. In light of this, we investigate the role of model uncertainty methods in the medical domain. Using RNN ensembles and various Bayesian RNNs, we show that population-level metrics, such as AUC-PR, AUC-ROC, log-likelihood, and calibration error, do not capture model uncertainty. Meanwhile, the presence of significant variability in patient-specific predictions and optimal decisions motivates the need for capturing model uncertainty. Understanding the uncertainty for individual patients is an area with clear clinical impact, such as determining when a model decision is likely to be brittle. We further show that RNNs with only Bayesian embeddings can be a more efficient way to capture model uncertainty compared to ensembles, and we analyze how model uncertainty is impacted across individual input features and patient subgroups.
View details
Learnability and Complexity of Quantum Samples
Li Li
Augustus Odena
Zhengli Zhao
Vadim Smelyanskiy
arxiv (2020)
Preview abstract
Given a quantum circuit, a quantum computer can sample the output distribution exponentially
faster in the number of bits than classical computers. A similar exponential separation has yet
to be established in generative models through quantum sample learning: given samples from
an n-qubit computation, can we learn the underlying quantum distribution using models with
training parameters that scale polynomial in n under a fixed training time? We study four kinds of
generative models: Deep Boltzmann machine (DBM), Generative Adversarial Networks (GANs),
Long Short-Term Memory (LSTM) and Autoregressive GAN, on learning quantum data set generated
by deep random circuits. We demonstrate the leading performance of LSTM in learning quantum
samples, and thus the autoregressive structure present in the underlying quantum distribution from
random quantum circuits. Both numerical experiments and a theoretical proof in the case of the
DBM show exponentially growing complexity of learning-agent parameters required for achieving
a fixed accuracy as n increases. Finally, we establish a connection between learnability and the
complexity of generative models by benchmarking learnability against different sets of samples drawn
from probability distributions of variable degrees of complexities in their quantum and classical
representations.
View details
Google COVID-19 Search Trends Symptoms Dataset: Anonymization Process Description
Akim Kumok
Chaitanya Kamath
Charlotte Stanton
Damien Desfontaines
Evgeniy Gabrilovich
Gerardo Flores
Gregory Alexander Wellenius
Ilya Eckstein
John S. Davis
Katie Everett
Krishna Kumar Gadepalli
Rayman Huang
Shailesh Bavadekar
Thomas Ludwig Roessler
Venky Ramachandran
Yael Mayer
Arxiv.org, N/A (2020)
Preview abstract
This report describes the aggregation and anonymization process applied to the initial version of COVID-19 Search Trends symptoms dataset, a publicly available dataset that shows aggregated, anonymized trends in Google searches for symptoms (and some related topics). The anonymization process is designed to protect the daily search activity of every user with \varepsilon-differential privacy for \varepsilon = 1.68.
View details
Natural Questions: a Benchmark for Question Answering Research
Olivia Redfield
Danielle Epstein
Illia Polosukhin
Matthew Kelcey
Jacob Devlin
Llion Jones
Ming-Wei Chang
Jakob Uszkoreit
Transactions of the Association of Computational Linguistics (2019) (to appear)
Preview abstract
We present the Natural Questions corpus, a question answering dataset. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations, 7,830 examples with 5-way annotations for development data, and a further 7,842 examples 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.
View details
Preview abstract
Clinical notes in electronic health records contain highly heterogeneous writing styles, including non-standard terminology or abbreviations. Using these notes in predictive modeling has traditionally required preprocessing (e.g. taking frequent terms or topic modeling) that removes much of the richness of the source data. We propose a pretrained hierarchical recurrent neural network model that parses minimally processed clinical notes in an intuitive fashion, and show that it improves performance for discharge diagnosis classification tasks on the Medical Information Mart for Intensive Care III (MIMIC-III) dataset, compared to models that treat the notes as an unordered collection of terms or that conduct no pretraining. We also apply an attribution technique to examples to identify the words that the model uses to make its prediction, and show the importance of the words' nearby context.
View details
Gmail Smart Compose: Real-Time Assisted Writing
Jackie Tsay
Justin Lu
Shuyuan Zhang
Tim Sohn
Yinan Wang
KDD 2019 (2019)
Preview abstract
In this paper, we present Smart Compose, a novel system for generating interactive, real-time suggestions in Gmail that assists users in writing mails by reducing repetitive typing. In the design and deployment of such a large-scale and complicated system, we faced several challenges including model selection, performance evaluation, serving and other practical issues. At the core of Smart Compose is a large-scale neural language model. We leveraged state-of-the-art machine learning techniques for language model training which enabled high-quality suggestion prediction, and constructed novel serving infrastructure for high-throughput and real-time inference. Experimental results show the effectiveness of our proposed system design and deployment approach. This system is currently being served in Gmail.
View details
Music Transformer: Generating Music with Long-Term Structure
Ashish Vaswani
Jakob Uszkoreit
Noam Shazeer
Curtis Hawthorne
Monica Dinculescu
ICLR (2019)
Preview abstract
Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter.
View details
Preview abstract
Data are often labelled by many different experts with each expert only labeling a small fraction of the data and each data point being labelled by several experts. This reduces the workload on individual experts and also gives a better estimate of the unobserved ground truth. When experts disagree, the standard approaches are to treat the majority opinion as the correct label or to model the correct label as a distribution. These approaches, however, do not make any use of potentially valuable information about which expert produced which label. To make use of this extra information, we propose modeling the experts individually and then learning mixing proportions for combining them in sample-specific ways. This allows us to give more weight to more reliable experts and makes it possible to take advantage of the unique strengths of individual experts at classifying certain types of data. Here we show that our approach leads to improved computer-aided diagnosis of diabetic retinopathy, where the experts are human doctors and the data are retinal images. We compare our method against those of Welinder and Perona, and Mnih and Hinton. Our work offers an innovative approach for dealing with the myriad real-world settings that lack ground truth labels.
View details
AirDialogue: An Environment for Goal-Oriented Dialogue Research
Wei Wei
Jia Li
Empirical Methods in Natural Language Processing (EMNLP) (2018)
Preview abstract
Recent progress in dialogue generation has inspired a number of studies on dialogue systems that are capable of accomplishing tasks through natural language interactions. A promising direction among these studies is the use of reinforcement learning techniques, such as self-play, for training dialogue agents. However, current datasets are limited in size, and the environment for training agents and evaluating process is relatively unsophisticated. We present AirDialogue, a large dataset that contains 402,038 goal-oriented conversations. To collect this dataset, we create a contextgenerator which provides travel and flight restrictions. We then ask human annotators to play the role of a customer or an agent and interact with the goal of successfully booking a trip given the restrictions. Key to our environment is the ease of evaluating the success of the dialogue, which is achieved by using ground-truth states (eg, the flight being booked) generated by the restrictions. Any dialogue agent that does not generate the correct states is considered to fail. Our experimental results indicate that state-of-the-art dialogue models on the test dataset can only achieve a scaled score of 0.22 and an exact match score of 0.1 while humans can reach a score of 0.94 and 0.93 respectively, which suggests significant opportunities for future improvement.
View details
Preview abstract
Natural language text exhibits implicit hierarchical structure in a variety of respects. Ideally we could incorporate our prior knowledge of the existence of some sort of hierarchy into unsupervised learning algorithms that work on text data. Recent work by Nickel and Kiela (2017) proposed using hyperbolic instead of Euclidean embedding spaces to represent hierarchical data and demonstrated encouraging results on supervised embedding tasks. In this work, apply their approach to unsupervised learning of word and sentence embeddings. Although we obtain mildly positive results, we describe the challenges we faced in using the hyperbolic metric for these problems both in terms of improving performance in downstream tasks and in understanding the learned hierarchical structures.
View details
Preview abstract
Despite recent advances in training recurrent neural networks (RNNs), capturing long-term dependencies in sequences remains a fundamental challenge. Most approaches use backpropagation through time (BPTT), which is difficult to scale to very long sequences. This paper proposes a simple method that improves the ability to capture long term dependencies in RNNs by adding an unsupervised auxiliary loss to the original objective. This auxiliary loss forces RNNs to either reconstruct previous events or predict next events in a sequence, making truncated backpropagation feasible for long sequences and also improving full BPTT. We evaluate our method on a variety of settings, including pixel-by-pixel image classification with sequence lengths up to 16\,000, and a real document classification benchmark. Our results highlight good performance and resource efficiency of this approach over competitive baselines, including other recurrent models and a comparable sized Transformer. Further analyses reveal beneficial effects of the auxiliary loss on optimization and regularization, as well as extreme cases where there is little to no backpropagation.
View details
Scalable and accurate deep learning for electronic health records
Alvin Rishi Rajkomar
Eyal Oren
Nissan Hajaj
Mila Hardt
Xiaobing Liu
Jake Marcus
Patrik Per Sundberg
Kun Zhang
Yi Zhang
Gerardo Flores
Gavin Duggan
Jamie Irvine
Kurt Litsch
Alex Mossin
Justin Jesada Tansuwan
De Wang
Dana Ludwig
Samuel Volchenboum
Kat Chou
Michael Pearson
Srinivasan Madabushi
Nigam Shah
Atul Butte
npj Digital Medicine (2018)
Preview abstract
Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient’s record. We propose a representation of patients’ entire raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two U.S. academic medical centers with 216,221 adult patients hospitalized for at least 24 hours. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting: in-hospital mortality (AUROC across sites 0.93-0.94), 30-day unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and all of a patient’s final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed state-of-the-art traditional predictive models in all cases. We also present a case-study of a neural-network attribution system, which illustrates how clinicians can gain some transparency into the predictions. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios, complete with explanations that directly highlight evidence in the patient’s chart.
View details
Many Paths to Equilibrium: GANs do not need to decrease a divergence at every step
William Fedus
Mihaela Rosca
Shakir Mohamed
Ian Goodfellow
ICLR (2018)
Preview abstract
Generative adversarial networks (GANs) are a family of generative models that
do not minimize a single training criterion. Unlike other generative models, the
data distribution is learned via a game between a generator (the generative model)
and a discriminator (a teacher providing training signal) that each minimize their
own cost. GANs are designed to reach a Nash equilibrium at which each player
cannot reduce their cost without changing the other players’ parameters. One
useful approach for the theory of GANs is to show that a divergence between
the training distribution and the model distribution obtains its minimum value at
equilibrium. Several recent research directions have been motivated by the idea
that this divergence is the primary guide for the learning process and that every
step of learning should decrease the divergence. We show that this view is overly
restrictive. During GAN training, the discriminator provides learning signal in
situations where the gradients of the divergences between distributions would not
be useful. We provide empirical counterexamples to the view of GAN training as
divergence minimization. Specifically, we demonstrate that GANs are able to learn
distributions in situations where the divergence minimization point of view predicts
they would fail. We also show that gradient penalties motivated from the divergence
minimization perspective are equally helpful when applied in other contexts in
which the divergence minimization perspective does not predict they would be
helpful. This contributes to a growing body of evidence that GAN training may be
more usefully viewed as approaching Nash equilibria via trajectories that do not
necessarily minimize a specific divergence at each step.
View details
Peptide-Spectra Matching with Weak Supervision
Sam Schoenholz
Sean Hackett
Laura Deming
Eugene Melamud
Navdeep Jaitly
Fiona McAllister
Jonathon O'Brien
Bryson Bennett
Daphne Koller
arXiv (2018)
Preview abstract
As in many other scientific domains, we face a fundamental problem when using
machine learning to identify proteins from mass spectrometry data: large ground
truth datasets mapping inputs to correct outputs are extremely difficult to obtain.
Instead, we have access to imperfect hand-coded models crafted by domain experts.
In this paper, we apply deep neural networks to an important step of the protein
identification problem, the pairing of mass spectra with short sequences of amino
acids called peptides. We train our model to differentiate between top scoring
results from a state-of-the art classical system and hard-negative second and third
place results. Our resulting model is much better at identifying peptides with
spectra than the model used to generate its training data. In particular, we achieve
a 43% improvement over standard matching methods and a 10% improvement
over a combination of the matching method and an industry standard cross-spectra
reranking tool. Importantly, in a more difficult experimental regime that reflects
current challenges facing biologists, our advantage over the previous state-of-theart
grows to 15% even after reranking. We believe this approach will generalize to
other challenging scientific problems.
View details
Preview abstract
Recurrent neural networks (RNNs) are a common method of generating text token by token. These models are typically trained via maximum likelihood (known in this context as teacher forcing). However, this approach frequently suffers from problems when using a trained model to generate new text since when generating words later in the sequence the model often conditions on a sequence of
words that was never observed at training time. We explore methods for using Generative Adversarial Networks (GANs) as an alternative to teacher forcing to generate discrete sequences.
In particular, we consider a conditional GAN that fills in missing text conditioned on the surrounding context. We show qualitatively and quantitatively evidence that this produces more realistic text samples compared to a maximum likelihood trained model. We also propose a new task that quantitatively measures the quality of RNN produced samples.
View details
Preview abstract
Adversarial training provides a means of regularizing supervised learning algorithms while virtual adversarial training is able to extend supervised learning algorithms to the semi-supervised setting. However, both methods require making small perturbations to numerous entries of the input vector, which is inappropriate for sparse high-dimensional inputs such as one-hot word representations. We extend adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieves state of the art results on multiple benchmark semi-supervised and purely supervised tasks. We provide visualizations and analysis showing that the learned word embeddings have improved in quality and that while training, the model is less prone to overfitting.
View details
Preview abstract
Data are often labeled by many different experts with each expert only labeling a small fraction of the data and each data point being labeled by several experts. This reduces the workload on individual experts and also gives a better estimate of the unobserved ground truth. When experts disagree, the standard approaches are to treat the majority opinion as the correct label or to model the correct label as a distribution. These approaches, however, do not make any use of potentially valuable information about which expert produced which label. To make use of this extra information, we propose modeling the experts individually and then learning averaging weights for combining them, possibly in sample-specific ways. This allows us to give more weight to more reliable experts and take advantage of the unique strengths of individual experts at classifying certain types of data. Here we show that our approach leads to improvements in computer-aided diagnosis of diabetic retinopathy. We also show that our method performs better than competing algorithms by Welinder and Perona, and by Mnih and Hinton. Our work offers an innovative approach for dealing with the myriad real-world settings that use expert opinions to define labels for training.
View details
Preview abstract
This work explores hypernetworks: an approach of using a one network, also known as a hypernetwork, to generate the weights for another network. Hypernetworks provide an abstraction that is similar to what is found in nature: the relationship between a genotype - the hypernetwork - and a phenotype - the main network. Though they are also reminiscent of HyperNEAT in evolution, our hypernetworks are trained end-to-end with backpropagation and thus are usually faster. The focus of this work is to make hypernetworks useful for deep convolutional networks and long recurrent networks, where hypernetworks can be viewed as relaxed form of weight-sharing across layers. Our main result is that hypernetworks can generate non-shared weights for LSTM and achieve near state-of-the-art results on a variety of sequence modelling tasks including character-level language modelling, handwriting generation and neural machine translation, challenging the weight-sharing paradigm for recurrent networks. Our results also show that hypernetworks applied to convolutional networks still achieve respectable results for image recognition tasks compared to state-of-the-art baseline models while requiring fewer learnable parameters.
View details
Preview abstract
The standard recurrent neural network language model (RNNLM) generates sentences one word at a time and does not work from an explicit global sentence representation. In this work, we introduce and study an RNN-based variational autoencoder generative model that incorporates distributed latent representations of entire sentences. This factorization allows it to explicitly model holistic properties of sentences such as style, topic, and high-level syntactic features. Samples from the prior over these sentence representations remarkably produce diverse and well-formed sentences through simple deterministic decoding. By examining paths through this latent space, we are able to generate coherent novel sentences that interpolate between known sentences. We present techniques for solving the difficult learning problem presented by this model, demonstrate its effectiveness in imputing missing words, explore many interesting properties of the model's latent sentence space, and present negative results on the use of the model in language modeling.
View details
Preview abstract
Adversarial training provides a means of regularizing supervised learning algorithms while virtual adversarial training is able to extend supervised learning algorithms to the semi-supervised setting. However, both methods require making small perturbations to numerous entries of the input vector, which is inappropriate for sparse high-dimensional inputs such as one-hot word representations. We extend adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieves state of the art results on multiple benchmark semi-supervised and purely supervised tasks. We provide visualizations and analysis showing that the learned word embeddings have improved in quality and that while training, the model is less prone to overfitting.
View details
Preview abstract
This work explores hypernetworks: an approach of using one network, also known as a hypernetwork, to generate the weights for another network. We apply hypernetworks to generate adaptive weights for recurrent networks. In this case, hypernetworks can be viewed as a relaxed form of weight-sharing across layers. In our implementation, hypernetworks are are trained jointly with the main network in an end-to-end fashion. Our main result is that hypernetworks can generate non-shared weights for LSTM and achieve state-of-the-art results on a variety of sequence modelling tasks including character-level language modelling, handwriting generation and neural machine translation, challenging the weight-sharing paradigm for recurrent networks.
View details
Preview abstract
We present two approaches that use unlabeled data to improve sequence learning with recurrent networks. The first approach is to predict what comes next in a sequence, which is a conventional language model in natural language processing. The second approach is to use a sequence autoencoder, which reads the input sequence into a vector and predicts the input sequence again. These two algorithms can be used as a “pretraining” step for a later supervised sequence learning algorithm. In other words, the parameters obtained from the unsupervised step can be used as a starting point for other supervised training models. In our experiments, we find that long short term memory recurrent networks after being pretrained with the two approaches are more stable and generalize better. With pretraining, we are able to train long short term recurrent networks up to a few hundred timesteps, thereby achieving strong performance in many text classification tasks, such as IMDB, DBpedia and 20 Newsgroups.
View details
Preview abstract
Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be leveraged for sentiment analysis. That proof of concept, while encouraging, was rather narrow. Here we consider tasks other than sentiment analysis, provide a more thorough comparison of Paragraph Vectors to other document modelling algorithms such as Latent Dirichlet Allocation, and evaluate performance of the method as we vary the dimensionality of the learned representation. We benchmarked the models on two document similarity data sets, one from Wikipedia, one from arXiv. We observe that the Paragraph Vector method performs significantly better than other methods, and propose a simple improvement to enhance embedding quality. Somewhat surprisingly, we also show that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.
View details
Preview abstract
Translating compounds is an important problem in machine translation. Since many compounds have not been observed during training, they pose a challenge for translation systems. Previous decompounding methods have
often been restricted to a small set of languages as they cannot deal with more complex compound forming processes. We present a novel and unsupervised method to learn the
compound parts and morphological operations needed to split compounds into their compound parts. The method uses a bilingual corpus to learn the morphological operations
required to split a compound into its parts. Furthermore, monolingual corpora are used to learn and filter the set of compound part candidates. We evaluate our method within a machine translation task and show significant improvements for various languages to show the versatility of the approach.
View details
No Results Found