Jump to Content
Andrew M. Dai

Andrew M. Dai

Andrew is a software engineer at Google. Prior to that he completed a PhD in machine learning at the University of Edinburgh and a MA in computer science at the University of Cambridge.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract The development of language models have moved from encoder-decoder to decoder-only designs. In addition, the common knowledge has it that the two most popular multimodal tasks, the generative and contrastive tasks, tend to conflict with one another, are hard to accommodate in one architecture, and further need complex adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach. View details
    Sparsely Activated Language Models are Efficient In-Context Learners
    Barret Richard Zoph
    Dmitry (Dima) Lepikhin
    Emma Wang
    Kun Zhang
    Liam B. Fedus
    Maarten Paul Bosma
    Marie Pellat
    Maxim Krikun
    Nan Du
    Simon Tong
    Tao Wang
    Toju Duke
    Yuanzhong Xu
    Zongwei Zhou
    (2022)
    Preview abstract Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong performance on few-shot learning. However, training these large dense models require significant amounts of computing resources. In this paper, we develop a family of sparsely activated mixture-of-expert language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which can have many more parameters but require significant less training cost than dense models. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3 but can be trained more efficiently. With only 1/3 of energy consumption to train GPT-3, \glam achieves better overall performance on 29 zero-shot and one-shot NLP tasks. For example, \glam gets 75.0\% one-shot exact match accuracy on the TriviaQA test server, a significant improvement over 68.0\% obtained by GPT-3. View details
    Preview abstract Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world—their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind’s MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100× larger. Finally, we confirm the robustness of Mind's Eye through ablation studies. View details
    Finetuned Language Models are Zero-Shot Learners
    Jason Wei
    Maarten Paul Bosma
    Vincent Zhao
    Nan Du
    International Conference on Learning Representations (2022)
    Preview abstract This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning---finetuning language models on a collection of tasks described via instructions---substantially boosts zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of tasks and model scale are key components to the success of instruction tuning. View details
    PaLM: Scaling Language Modeling with Pathways
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Michele Catasta
    Jason Wei
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    Evaluation of US State-Based Policy Interventions on Social Distancing Using Aggregated Mobility Data during the COVID-19 Pandemic
    Gregory Alexander Wellenius
    Swapnil Suresh Vispute
    Valeria Espinosa
    Thomas Tsai
    Jonathan Hennessy
    Krishna Kumar Gadepalli
    Adam Boulanger
    Adam Pearce
    Chaitanya Kamath
    Arran Schlosberg
    Catherine Bendebury
    Chinmoy Mandayam
    Charlotte Stanton
    Shailesh Bavadekar
    Christopher David Pluntke
    Damien Desfontaines
    Benjamin H. Jacobson
    Zan Armstrong
    Katherine Chou
    Andrew Nathaniel Oplinger
    Ashish K. Jha
    Evgeniy Gabrilovich
    Nature Communications (2021)
    Preview abstract Social distancing has emerged as the primary mitigation strategy to combat the COVID-19 pandemic in the United States. However, large-scale evaluation of the effectiveness of social distancing policies are lacking. We used aggregated mobility data to quantify the impact of social distancing policies on observed changes in mobility. Declarations of states of emergency resulted in approximately a 10% reduction in time spent outside places of residence and an increase in visits to grocery stores and pharmacies. Subsequent implementation of ≥1 social distancing policies resulted in an additional 25% reduction in mobility in the following week. The seven states that subsequently ordered residents to shelter in place on or before March 23, 2020 observed an additional 29% reduction in time spent outside the residence. Our findings suggest that state-wide mandates are highly effective in achieving the goals of social distancing to minimize the transmission of COVID-19. View details
    Training independent subnetworks for robust prediction
    Marton Havasi
    Rodolphe Jenatton
    Stanislav Fort
    International Conference on Learning Representations (2021)
    Preview abstract Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant runtime cost. In this work, we show a surprising result: the benefits of using multiple predictions can be achieved 'for free' under a single model's forward pass. In particular, we show that, using a multi-input multi-output (MIMO) configuration, one can utilize a single model's capacity to train multiple subnetworks that independently learn the task at hand. By ensembling the predictions made by the subnetworks, we improve model robustness without increasing compute. We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, ImageNet, and their out-of-distribution variants compared to previous methods. View details
    Preview abstract Capturing the inter-dependencies among multiple types of clinically-critical events is critical not only to accurate future event prediction, but also to better treatment planning. In this work, we propose a deep latent state-space generative model to capture the interactions among different types of correlated clinical events (e.g., kidney failure, mortality) by explicitly modeling the temporal dynamics of patients' latent states. Based on these learned patient states, we further develop a new general discrete-time formulation of the hazard rate function to estimate the survival distribution of patients with significantly improved accuracy. Extensive evaluations over real EMR data show that our proposed model compares favorably to various state-of-the-art baselines. Furthermore, our method also uncovers meaningful insights about the latent correlations among mortality and different types of organ failures. View details
    Preview abstract In a general inpatient population, we predicted patient‐specific medication orders based on structured information in the electronic health record (EHR). Data on over three million medication orders from an academic medical center were used to train two machine‐learning models: A deep learning sequence model and a logistic regression model. Both were compared with a baseline that ranked the most frequently ordered medications based on a patient’s discharge hospital service and amount of time since admission. Models were trained to predict from 990 possible medications at the time of order entry. Fifty‐five percent of medications ordered by physicians were ranked in the sequence model’s top‐10 predictions (logistic model: 49%) and 75% ranked in the top‐25 (logistic model: 69%). Ninety‐three percent of the sequence model’s top‐10 prediction sets contained at least one medication that physicians ordered within the next day. These findings demonstrate that medication orders can be predicted from information present in the EHR. View details
    Preview abstract The paradigm of pretraining' from a set of relevant auxiliary tasks and thenfinetuning' on a target task has been successfully applied in many different domains. However, when the auxiliary tasks are abundant, with complex relationships to the target task, using domain knowledge or searching over all possible pretraining setups are inefficient strategies. To address this challenge, we propose a method to automatically select from a large set of auxiliary tasks which yield a representation most useful to the target task. In particular, we develop an efficient algorithm that uses automatic auxiliary task selection within a nested-loop meta-learning process. We have applied this algorithm to the task of clinical outcome predictions in electronic medical records, learning from a large number of self-supervised tasks related to forecasting patient trajectories. Experiments on a real clinical dataset demonstrate the superior predictive performance of our method compared to direct supervised learning, naive pretraining and multitask learning, in particular in low-data scenarios when the primary task has very few examples. With detailed ablation analysis, we further show that the selection rules are interpretable and able to generalize to unseen target tasks with new data. View details
    Analyzing the Role of Model Uncertainty for Electronic Health Records
    Edward Choi
    Jeremy Nixon
    Ghassen Jerfel
    ACM Conference on Health, Inference, and Learning (ACM CHIL) (2020)
    Preview abstract In medicine, both ethical and monetary costs of incorrect predictions can be significant, and the complexity of the problems often necessitates increasingly complex models. Recent work has shown that changing just the random seed is enough for otherwise well-tuned deep neural networks to vary in their individual predicted probabilities. In light of this, we investigate the role of model uncertainty methods in the medical domain. Using RNN ensembles and various Bayesian RNNs, we show that population-level metrics, such as AUC-PR, AUC-ROC, log-likelihood, and calibration error, do not capture model uncertainty. Meanwhile, the presence of significant variability in patient-specific predictions and optimal decisions motivates the need for capturing model uncertainty. Understanding the uncertainty for individual patients is an area with clear clinical impact, such as determining when a model decision is likely to be brittle. We further show that RNNs with only Bayesian embeddings can be a more efficient way to capture model uncertainty compared to ensembles, and we analyze how model uncertainty is impacted across individual input features and patient subgroups. View details
    Preview abstract Given a quantum circuit, a quantum computer can sample the output distribution exponentially faster in the number of bits than classical computers. A similar exponential separation has yet to be established in generative models through quantum sample learning: given samples from an n-qubit computation, can we learn the underlying quantum distribution using models with training parameters that scale polynomial in n under a fixed training time? We study four kinds of generative models: Deep Boltzmann machine (DBM), Generative Adversarial Networks (GANs), Long Short-Term Memory (LSTM) and Autoregressive GAN, on learning quantum data set generated by deep random circuits. We demonstrate the leading performance of LSTM in learning quantum samples, and thus the autoregressive structure present in the underlying quantum distribution from random quantum circuits. Both numerical experiments and a theoretical proof in the case of the DBM show exponentially growing complexity of learning-agent parameters required for achieving a fixed accuracy as n increases. Finally, we establish a connection between learnability and the complexity of generative models by benchmarking learnability against different sets of samples drawn from probability distributions of variable degrees of complexities in their quantum and classical representations. View details
    Learning the Graphical Structure of Electronic Health Records with Graph Convolutional Transformer
    Edward Choi
    Zhen Xu
    Yujia Li
    Gerardo Flores
    Association for the Advancement of Artificial Intelligence (AAAI) (2020)
    Preview abstract Effective modeling of electronic health records (EHR) is rapidly becoming an important topic in both academia and industry. A recent study showed that using the graphical structure underlying EHR data (e.g. relationship between diagnoses and treatments) improves the performance of prediction tasks such as heart failure prediction. However, EHR data do not always contain complete structure information. Moreover, when it comes to claims data, structure information is completely unavailable to begin with. Under such circumstances, can we still do better than just treating EHR data as a flat-structured bag-of-features? In this paper, we study the possibility of jointly learning the hidden structure of EHR while performing supervised prediction tasks on EHR data. Specifically, we discuss that Transformer is a suitable basis model to learn the hidden EHR structure, and propose Graph Convolutional Transformer, which uses data statistics to guide the structure learning process. The proposed model consistently outperformed previous approaches empirically, on both synthetic data and publicly available EHR data, for various prediction tasks such as graph reconstruction and readmission prediction, indicating that it can serve as an effective general-purpose representation learning algorithm for EHR data. View details
    Google COVID-19 Search Trends Symptoms Dataset: Anonymization Process Description
    Akim Kumok
    Chaitanya Kamath
    Charlotte Stanton
    Damien Desfontaines
    Evgeniy Gabrilovich
    Gerardo Flores
    Gregory Alexander Wellenius
    Ilya Eckstein
    John S. Davis
    Katie Everett
    Krishna Kumar Gadepalli
    Rayman Huang
    Shailesh Bavadekar
    Thomas Ludwig Roessler
    Venky Ramachandran
    Yael Mayer
    Arxiv.org, N/A (2020)
    Preview abstract This report describes the aggregation and anonymization process applied to the initial version of COVID-19 Search Trends symptoms dataset, a publicly available dataset that shows aggregated, anonymized trends in Google searches for symptoms (and some related topics). The anonymization process is designed to protect the daily search activity of every user with \varepsilon-differential privacy for \varepsilon = 1.68. View details
    Natural Questions: a Benchmark for Question Answering Research
    Olivia Redfield
    Danielle Epstein
    Illia Polosukhin
    Matthew Kelcey
    Jacob Devlin
    Llion Jones
    Ming-Wei Chang
    Jakob Uszkoreit
    Transactions of the Association of Computational Linguistics (2019) (to appear)
    Preview abstract We present the Natural Questions corpus, a question answering dataset. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations, 7,830 examples with 5-way annotations for development data, and a further 7,842 examples 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature. View details
    Preview abstract Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter. View details
    Preview abstract Clinical notes in electronic health records contain highly heterogeneous writing styles, including non-standard terminology or abbreviations. Using these notes in predictive modeling has traditionally required preprocessing (e.g. taking frequent terms or topic modeling) that removes much of the richness of the source data. We propose a pretrained hierarchical recurrent neural network model that parses minimally processed clinical notes in an intuitive fashion, and show that it improves performance for discharge diagnosis classification tasks on the Medical Information Mart for Intensive Care III (MIMIC-III) dataset, compared to models that treat the notes as an unordered collection of terms or that conduct no pretraining. We also apply an attribution technique to examples to identify the words that the model uses to make its prediction, and show the importance of the words' nearby context. View details
    Preview abstract In this paper, we present Smart Compose, a novel system for generating interactive, real-time suggestions in Gmail that assists users in writing mails by reducing repetitive typing. In the design and deployment of such a large-scale and complicated system, we faced several challenges including model selection, performance evaluation, serving and other practical issues. At the core of Smart Compose is a large-scale neural language model. We leveraged state-of-the-art machine learning techniques for language model training which enabled high-quality suggestion prediction, and constructed novel serving infrastructure for high-throughput and real-time inference. Experimental results show the effectiveness of our proposed system design and deployment approach. This system is currently being served in Gmail. View details
    Preview abstract Recurrent neural networks (RNNs) are a common method of generating text token by token. These models are typically trained via maximum likelihood (known in this context as teacher forcing). However, this approach frequently suffers from problems when using a trained model to generate new text since when generating words later in the sequence the model often conditions on a sequence of words that was never observed at training time. We explore methods for using Generative Adversarial Networks (GANs) as an alternative to teacher forcing to generate discrete sequences. In particular, we consider a conditional GAN that fills in missing text conditioned on the surrounding context. We show qualitatively and quantitatively evidence that this produces more realistic text samples compared to a maximum likelihood trained model. We also propose a new task that quantitatively measures the quality of RNN produced samples. View details
    Preview abstract Generative adversarial networks (GANs) are a family of generative models that do not minimize a single training criterion. Unlike other generative models, the data distribution is learned via a game between a generator (the generative model) and a discriminator (a teacher providing training signal) that each minimize their own cost. GANs are designed to reach a Nash equilibrium at which each player cannot reduce their cost without changing the other players’ parameters. One useful approach for the theory of GANs is to show that a divergence between the training distribution and the model distribution obtains its minimum value at equilibrium. Several recent research directions have been motivated by the idea that this divergence is the primary guide for the learning process and that every step of learning should decrease the divergence. We show that this view is overly restrictive. During GAN training, the discriminator provides learning signal in situations where the gradients of the divergences between distributions would not be useful. We provide empirical counterexamples to the view of GAN training as divergence minimization. Specifically, we demonstrate that GANs are able to learn distributions in situations where the divergence minimization point of view predicts they would fail. We also show that gradient penalties motivated from the divergence minimization perspective are equally helpful when applied in other contexts in which the divergence minimization perspective does not predict they would be helpful. This contributes to a growing body of evidence that GAN training may be more usefully viewed as approaching Nash equilibria via trajectories that do not necessarily minimize a specific divergence at each step. View details
    Embedding Text in Hyperbolic Spaces
    Bhuwan Dhingra
    Chris Shallue
    Mohammad Norouzi
    NAACL Workshop (2018)
    Preview abstract Natural language text exhibits implicit hierarchical structure in a variety of respects. Ideally we could incorporate our prior knowledge of the existence of some sort of hierarchy into unsupervised learning algorithms that work on text data. Recent work by Nickel and Kiela (2017) proposed using hyperbolic instead of Euclidean embedding spaces to represent hierarchical data and demonstrated encouraging results on supervised embedding tasks. In this work, apply their approach to unsupervised learning of word and sentence embeddings. Although we obtain mildly positive results, we describe the challenges we faced in using the hyperbolic metric for these problems both in terms of improving performance in downstream tasks and in understanding the learned hierarchical structures. View details
    Peptide-Spectra Matching with Weak Supervision
    Sam Schoenholz
    Sean Hackett
    Laura Deming
    Eugene Melamud
    Navdeep Jaitly
    Fiona McAllister
    Jonathon O'Brien
    Bryson Bennett
    Daphne Koller
    arXiv (2018)
    Preview abstract As in many other scientific domains, we face a fundamental problem when using machine learning to identify proteins from mass spectrometry data: large ground truth datasets mapping inputs to correct outputs are extremely difficult to obtain. Instead, we have access to imperfect hand-coded models crafted by domain experts. In this paper, we apply deep neural networks to an important step of the protein identification problem, the pairing of mass spectra with short sequences of amino acids called peptides. We train our model to differentiate between top scoring results from a state-of-the art classical system and hard-negative second and third place results. Our resulting model is much better at identifying peptides with spectra than the model used to generate its training data. In particular, we achieve a 43% improvement over standard matching methods and a 10% improvement over a combination of the matching method and an industry standard cross-spectra reranking tool. Importantly, in a more difficult experimental regime that reflects current challenges facing biologists, our advantage over the previous state-of-theart grows to 15% even after reranking. We believe this approach will generalize to other challenging scientific problems. View details
    AirDialogue: An Environment for Goal-Oriented Dialogue Research
    Wei Wei
    Jia Li
    Empirical Methods in Natural Language Processing (EMNLP) (2018)
    Preview abstract Recent progress in dialogue generation has inspired a number of studies on dialogue systems that are capable of accomplishing tasks through natural language interactions. A promising direction among these studies is the use of reinforcement learning techniques, such as self-play, for training dialogue agents. However, current datasets are limited in size, and the environment for training agents and evaluating process is relatively unsophisticated. We present AirDialogue, a large dataset that contains 402,038 goal-oriented conversations. To collect this dataset, we create a contextgenerator which provides travel and flight restrictions. We then ask human annotators to play the role of a customer or an agent and interact with the goal of successfully booking a trip given the restrictions. Key to our environment is the ease of evaluating the success of the dialogue, which is achieved by using ground-truth states (eg, the flight being booked) generated by the restrictions. Any dialogue agent that does not generate the correct states is considered to fail. Our experimental results indicate that state-of-the-art dialogue models on the test dataset can only achieve a scaled score of 0.22 and an exact match score of 0.1 while humans can reach a score of 0.94 and 0.93 respectively, which suggests significant opportunities for future improvement. View details
    Preview abstract Despite recent advances in training recurrent neural networks (RNNs), capturing long-term dependencies in sequences remains a fundamental challenge. Most approaches use backpropagation through time (BPTT), which is difficult to scale to very long sequences. This paper proposes a simple method that improves the ability to capture long term dependencies in RNNs by adding an unsupervised auxiliary loss to the original objective. This auxiliary loss forces RNNs to either reconstruct previous events or predict next events in a sequence, making truncated backpropagation feasible for long sequences and also improving full BPTT. We evaluate our method on a variety of settings, including pixel-by-pixel image classification with sequence lengths up to 16\,000, and a real document classification benchmark. Our results highlight good performance and resource efficiency of this approach over competitive baselines, including other recurrent models and a comparable sized Transformer. Further analyses reveal beneficial effects of the auxiliary loss on optimization and regularization, as well as extreme cases where there is little to no backpropagation. View details
    Scalable and accurate deep learning for electronic health records
    Alvin Rishi Rajkomar
    Eyal Oren
    Nissan Hajaj
    Mila Hardt
    Xiaobing Liu
    Jake Marcus
    Patrik Per Sundberg
    Kun Zhang
    Yi Zhang
    Gerardo Flores
    Gavin Duggan
    Jamie Irvine
    Kurt Litsch
    Alex Mossin
    Justin Jesada Tansuwan
    De Wang
    Dana Ludwig
    Samuel Volchenboum
    Kat Chou
    Michael Pearson
    Srinivasan Madabushi
    Nigam Shah
    Atul Butte
    npj Digital Medicine (2018)
    Preview abstract Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient’s record. We propose a representation of patients’ entire raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two U.S. academic medical centers with 216,221 adult patients hospitalized for at least 24 hours. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting: in-hospital mortality (AUROC across sites 0.93-0.94), 30-day unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and all of a patient’s final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed state-of-the-art traditional predictive models in all cases. We also present a case-study of a neural-network attribution system, which illustrates how clinicians can gain some transparency into the predictions. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios, complete with explanations that directly highlight evidence in the patient’s chart. View details
    Preview abstract Data are often labelled by many different experts with each expert only labeling a small fraction of the data and each data point being labelled by several experts. This reduces the workload on individual experts and also gives a better estimate of the unobserved ground truth. When experts disagree, the standard approaches are to treat the majority opinion as the correct label or to model the correct label as a distribution. These approaches, however, do not make any use of potentially valuable information about which expert produced which label. To make use of this extra information, we propose modeling the experts individually and then learning mixing proportions for combining them in sample-specific ways. This allows us to give more weight to more reliable experts and makes it possible to take advantage of the unique strengths of individual experts at classifying certain types of data. Here we show that our approach leads to improved computer-aided diagnosis of diabetic retinopathy, where the experts are human doctors and the data are retinal images. We compare our method against those of Welinder and Perona, and Mnih and Hinton. Our work offers an innovative approach for dealing with the myriad real-world settings that lack ground truth labels. View details
    Preview abstract Data are often labeled by many different experts with each expert only labeling a small fraction of the data and each data point being labeled by several experts. This reduces the workload on individual experts and also gives a better estimate of the unobserved ground truth. When experts disagree, the standard approaches are to treat the majority opinion as the correct label or to model the correct label as a distribution. These approaches, however, do not make any use of potentially valuable information about which expert produced which label. To make use of this extra information, we propose modeling the experts individually and then learning averaging weights for combining them, possibly in sample-specific ways. This allows us to give more weight to more reliable experts and take advantage of the unique strengths of individual experts at classifying certain types of data. Here we show that our approach leads to improvements in computer-aided diagnosis of diabetic retinopathy. We also show that our method performs better than competing algorithms by Welinder and Perona, and by Mnih and Hinton. Our work offers an innovative approach for dealing with the myriad real-world settings that use expert opinions to define labels for training. View details
    Preview abstract Adversarial training provides a means of regularizing supervised learning algorithms while virtual adversarial training is able to extend supervised learning algorithms to the semi-supervised setting. However, both methods require making small perturbations to numerous entries of the input vector, which is inappropriate for sparse high-dimensional inputs such as one-hot word representations. We extend adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieves state of the art results on multiple benchmark semi-supervised and purely supervised tasks. We provide visualizations and analysis showing that the learned word embeddings have improved in quality and that while training, the model is less prone to overfitting. View details
    Preview abstract This work explores hypernetworks: an approach of using a one network, also known as a hypernetwork, to generate the weights for another network. Hypernetworks provide an abstraction that is similar to what is found in nature: the relationship between a genotype - the hypernetwork - and a phenotype - the main network. Though they are also reminiscent of HyperNEAT in evolution, our hypernetworks are trained end-to-end with backpropagation and thus are usually faster. The focus of this work is to make hypernetworks useful for deep convolutional networks and long recurrent networks, where hypernetworks can be viewed as relaxed form of weight-sharing across layers. Our main result is that hypernetworks can generate non-shared weights for LSTM and achieve near state-of-the-art results on a variety of sequence modelling tasks including character-level language modelling, handwriting generation and neural machine translation, challenging the weight-sharing paradigm for recurrent networks. Our results also show that hypernetworks applied to convolutional networks still achieve respectable results for image recognition tasks compared to state-of-the-art baseline models while requiring fewer learnable parameters. View details
    Preview abstract Adversarial training provides a means of regularizing supervised learning algorithms while virtual adversarial training is able to extend supervised learning algorithms to the semi-supervised setting. However, both methods require making small perturbations to numerous entries of the input vector, which is inappropriate for sparse high-dimensional inputs such as one-hot word representations. We extend adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieves state of the art results on multiple benchmark semi-supervised and purely supervised tasks. We provide visualizations and analysis showing that the learned word embeddings have improved in quality and that while training, the model is less prone to overfitting. View details
    Preview abstract This work explores hypernetworks: an approach of using one network, also known as a hypernetwork, to generate the weights for another network. We apply hypernetworks to generate adaptive weights for recurrent networks. In this case, hypernetworks can be viewed as a relaxed form of weight-sharing across layers. In our implementation, hypernetworks are are trained jointly with the main network in an end-to-end fashion. Our main result is that hypernetworks can generate non-shared weights for LSTM and achieve state-of-the-art results on a variety of sequence modelling tasks including character-level language modelling, handwriting generation and neural machine translation, challenging the weight-sharing paradigm for recurrent networks. View details
    Preview abstract The standard recurrent neural network language model (RNNLM) generates sentences one word at a time and does not work from an explicit global sentence representation. In this work, we introduce and study an RNN-based variational autoencoder generative model that incorporates distributed latent representations of entire sentences. This factorization allows it to explicitly model holistic properties of sentences such as style, topic, and high-level syntactic features. Samples from the prior over these sentence representations remarkably produce diverse and well-formed sentences through simple deterministic decoding. By examining paths through this latent space, we are able to generate coherent novel sentences that interpolate between known sentences. We present techniques for solving the difficult learning problem presented by this model, demonstrate its effectiveness in imputing missing words, explore many interesting properties of the model's latent sentence space, and present negative results on the use of the model in language modeling. View details
    Semi-supervised sequence learning
    Advances in Neural Information Processing Systems, NIPS (2015)
    Preview abstract We present two approaches that use unlabeled data to improve sequence learning with recurrent networks. The first approach is to predict what comes next in a sequence, which is a conventional language model in natural language processing. The second approach is to use a sequence autoencoder, which reads the input sequence into a vector and predicts the input sequence again. These two algorithms can be used as a “pretraining” step for a later supervised sequence learning algorithm. In other words, the parameters obtained from the unsupervised step can be used as a starting point for other supervised training models. In our experiments, we find that long short term memory recurrent networks after being pretrained with the two approaches are more stable and generalize better. With pretraining, we are able to train long short term recurrent networks up to a few hundred timesteps, thereby achieving strong performance in many text classification tasks, such as IMDB, DBpedia and 20 Newsgroups. View details
    Document embedding with paragraph vectors
    Christopher Olah
    NIPS Deep Learning Workshop (2014)
    Preview abstract Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be leveraged for sentiment analysis. That proof of concept, while encouraging, was rather narrow. Here we consider tasks other than sentiment analysis, provide a more thorough comparison of Paragraph Vectors to other document modelling algorithms such as Latent Dirichlet Allocation, and evaluate performance of the method as we vary the dimensionality of the learned representation. We benchmarked the models on two document similarity data sets, one from Wikipedia, one from arXiv. We observe that the Paragraph Vector method performs significantly better than other methods, and propose a simple improvement to enhance embedding quality. Somewhat surprisingly, we also show that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results. View details
    Preview abstract Translating compounds is an important problem in machine translation. Since many compounds have not been observed during training, they pose a challenge for translation systems. Previous decompounding methods have often been restricted to a small set of languages as they cannot deal with more complex compound forming processes. We present a novel and unsupervised method to learn the compound parts and morphological operations needed to split compounds into their compound parts. The method uses a bilingual corpus to learn the morphological operations required to split a compound into its parts. Furthermore, monolingual corpora are used to learn and filter the set of compound part candidates. We evaluate our method within a machine translation task and show significant improvements for various languages to show the versatility of the approach. View details
    No Results Found