George Dahl

George Dahl

George Dahl received his Ph.D. from the University of Toronto under the supervision of Geoff Hinton, where he worked on deep learning approaches to problems in speech recognition, computational chemistry, and natural language text processing, including some of the first successful deep acoustic models. He has been a research scientist at Google on the Brain team since 2015. His research focuses on highly flexible models that learn their own features, end-to-end, and make efficient use of data and computation for supervised, unsupervised, and reinforcement learning. In particular, he is interested in applications to linguistic and perceptual data as well as chemical, biological, and medical data. Google Scholar profile
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics such as gradient clipping and learning rate warmup. Our results demonstrate that successful model and hyperparameter choices allow the early optimization trajectory to either avoid---or navigate out of---regions of high curvature and into flatter regions that tolerate a higher learning rate. Our results suggest a unifying perspective on how disparate mitigation strategies for training instability ultimately address the same underlying failure mode of neural network optimization, namely poor conditioning. Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization, layer normalization, MetaInit, GradInit, and Fixup initialization. View details
    A mobile-optimized artificial intelligence system for gestational age and fetal malpresentation assessment
    Ryan Gomes
    Bellington Vwalika
    Chace Lee
    Angelica Willis
    Joan T. Price
    Christina Chen
    Margaret P. Kasaro
    James A. Taylor
    Elizabeth M. Stringer
    Scott Mayer McKinney
    Ntazana Sindano
    William Goodnight, III
    Justin Gilmer
    Benjamin H. Chi
    Charles Lau
    Terry Spitz
    Kris Liu
    Jonny Wong
    Rory Pilgrim
    Akib Uddin
    Lily Hao Yi Peng
    Kat Chou
    Jeffrey S. A. Stringer
    Shravya Ramesh Shetty
    Communications Medicine(2022)
    Preview abstract Background Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption in low-to-middle-income countries. This study investigated the use of artificial intelligence for fetal ultrasound in under-resourced settings. Methods Blind sweep ultrasounds, consisting of six freehand ultrasound sweeps, were collected by sonographers in the USA and Zambia, and novice operators in Zambia. We developed artificial intelligence (AI) models that used blind sweeps to predict gestational age (GA) and fetal malpresentation. AI GA estimates and standard fetal biometry estimates were compared to a previously established ground truth, and evaluated for difference in absolute error. Fetal malpresentation (non-cephalic vs cephalic) was compared to sonographer assessment. On-device AI model run-times were benchmarked on Android mobile phones. Results Here we show that GA estimation accuracy of the AI model is non-inferior to standard fetal biometry estimates (error difference -1.4 ± 4.5 days, 95% CI -1.8, -0.9, n=406). Non-inferiority is maintained when blind sweeps are acquired by novice operators performing only two of six sweep motion types. Fetal malpresentation AUC-ROC is 0.977 (95% CI, 0.949, 1.00, n=613), sonographers and novices have similar AUC-ROC. Software run-times on mobile phones for both diagnostic models are less than 3 seconds after completion of a sweep. Conclusions The gestational age model is non-inferior to the clinical standard and the fetal malpresentation model has high AUC-ROCs across operators and devices. Our AI models are able to run on-device, without internet connectivity, and provide feedback scores to assist in upleveling the capabilities of lightly trained ultrasound operators in low resource settings. View details
    Adaptive Gradient Methods at the Edge of Stability
    Behrooz Ghorbani
    David Cardoze
    Jeremy Cohen
    Justin Gilmer
    Shankar Krishnan
    NeuRIPS 2022(2022) (to appear)
    Preview abstract Little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we show that during full-batch training, the maximum eigenvalue of the \emph{preconditioned} Hessian typically equilibrates at the stability threshold of a related non-adaptive algorithm. For Adam with step size $\eta$ and $\beta_1 = 0.9$, this stability threshold is $38/\eta$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the “Edge of Stability,” their behavior in this regime differs in a crucial way from that of their non-adaptive counterparts. Whereas non-adaptive algorithms are forced to remain in low-curvature regions of the loss landscape, we demonstrate that adaptive gradient methods often advance into high-curvature regions, while adapting the preconditioner to compensate. We believe that our findings will serve as a foundation for the community’s future understanding of adaptive gradient methods in deep learning. View details
    Machine learning guided aptamer discovery
    Ali Bashir
    Geoff Davis
    Michelle Therese Dimon
    Qin Yang
    Scott Ferguson
    Zan Armstrong
    Nature Communications(2021)
    Preview abstract Aptamers are discovered by searching a large library for sequences with desirable binding properties. These libraries, however, are physically constrained to a fraction of the theoretical sequence space and limited to sampling strategies that are easy to scale. Integrating machine learning could enable identification of high-performing aptamers across this unexplored fitness landscape. We employed particle display (PD) to partition aptamers by affinity and trained neural network models to improve physically-derived aptamers and predict affinity in silico. These predictions were used to locally improve physically derived aptamers as well as identify completely novel, high-affinity aptamers de novo. We experimentally validated the predictions, improving aptamer candidate designs at a rate 10-fold higher than random perturbation, and generating novel aptamers at a rate 448-fold higher than PD alone. We characterized the explanatory power of the models globally and locally and showed successful sequence truncation while maintaining affinity. This work combines machine learning and physical discovery, uses principles that are widely applicable to other display technologies, and provides a path forward for better diagnostic and therapeutic agents. View details
    Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
    Guodong Zhang
    James Martens
    Sushant Sachdeva
    Chris Shallue
    Roger Grosse
    2019 Conference on Neural Information Processing Systems(2019)
    Preview abstract Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model (NQM). We experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and K-FAC, result in much larger critical batch sizes than stochastic gradient descent with momentum. We also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with. The NQM predicts our results with preconditioned optimizers, previous results with accelerated gradient descent, and other results around optimal learning rates and large batch training, making it a useful tool to generate testable predictions about neural network optimization. View details
    Relational inductive biases, deep learning, and graph networks
    Peter Battaglia
    Jessica Blake Chandler Hamrick
    Victor Bapst
    Alvaro Sanchez
    Vinicius Zambaldi
    Mateusz Malinowski
    Andrea Tacchetti
    David Raposo
    Adam Santoro
    Ryan Faulkner
    Caglar Gulcehre
    Francis Song
    Andy Ballard
    Justin Gilmer
    Ashish Vaswani
    Kelsey Allen
    Charles Nash
    Victoria Jayne Langston
    Chris Dyer
    Nicolas Heess
    Daan Wierstra
    Matt Botvinick
    Yujia Li
    Razvan Pascanu
    arXiv(2018)
    Preview abstract The purpose of this paper is to explore relational inductive biases in modern AI, especially deep learning, describing a rough taxonomy of existing approaches, and introducing a common mathematical framework for expressing and unifying various approaches. The key theme running through this work is structure—how the world is structured, and how the structure of different computational strategies determines their strengths and weaknesses. View details
    Preview abstract Sequence-to-sequence alignment is a widely-used analysis method in bioinformatics. One common use of sequence alignment is to infer information about an unknown query sequence from the annotations of similar sequences in a database, such as predicting the function of a novel protein sequence by aligning to a database of protein families or predicting the presence/absence of species in a metagenomics sample by aligning reads to a database of reference genomes. In this work we describe a deep learning approach to solve such problems in a single step by training a deep neural network (DNN) to predict the database-derived labels directly from the query sequence. We demonstrate the value of this DNN approach on a hard problem of practical importance: determining the species of origin of next-generation sequencing reads from 16s ribosomal DNA. In particular, we show that when trained on 16s sequences from more than 13,000 distinct species, our DNN can predict the species of origin of individual reads more accurately than existing machine learning baselines and alignment-based methods like BWA or BLAST, achieving absolute performance within 2.0% of perfect memorization of the training inputs. Moreover, the DNN remains accurate and outperforms read alignment approaches when the query sequences are especially noisy or ambiguous. Finally, these DNN models can be used to assess metagenomic community composition on a variety of experimental 16s read datasets. Our results are a first step towards our long-term goal of developing a general-purpose deep learning model that can learn to predict any type of label from short biological sequences. View details
    Preview abstract Neural language models are a critical component of state-of-the-art systems for machine translation, summarization, audio transcription, and other tasks. These language models are almost universally autoregressive in nature, generating sentences one token at a time from left to right. This paper studies the influence of token generation order on model quality via a novel two-pass language model that produces partially-filled sentence “templates” and then fills in missing tokens. We compare various strategies for structuring these two passes and observe a surprisingly large variation in model quality. We find the most effective strategy generates function words in the first pass followed by content words in the second. We believe these experimental results justify a more extensive investigation of generation order for neural language models. View details
    Peptide-Spectra Matching with Weak Supervision
    Sam Schoenholz
    Sean Hackett
    Laura Deming
    Eugene Melamud
    Navdeep Jaitly
    Fiona McAllister
    Jonathon O'Brien
    Bryson Bennett
    Daphne Koller
    arXiv(2018)
    Preview abstract As in many other scientific domains, we face a fundamental problem when using machine learning to identify proteins from mass spectrometry data: large ground truth datasets mapping inputs to correct outputs are extremely difficult to obtain. Instead, we have access to imperfect hand-coded models crafted by domain experts. In this paper, we apply deep neural networks to an important step of the protein identification problem, the pairing of mass spectra with short sequences of amino acids called peptides. We train our model to differentiate between top scoring results from a state-of-the art classical system and hard-negative second and third place results. Our resulting model is much better at identifying peptides with spectra than the model used to generate its training data. In particular, we achieve a 43% improvement over standard matching methods and a 10% improvement over a combination of the matching method and an industry standard cross-spectra reranking tool. Importantly, in a more difficult experimental regime that reflects current challenges facing biologists, our advantage over the previous state-of-theart grows to 15% even after reranking. We believe this approach will generalize to other challenging scientific problems. View details
    Artificial Intelligence Based Breast Cancer Nodal Metastasis Detection: Insights into the Black Box for Pathologists
    Timo Kohlberger
    Mohammad Norouzi
    Jenny Smith
    Arash Mohtashamian
    Niels Olson
    Lily Peng
    Jason Hipp
    Martin Stumpe
    Archives of Pathology & Laboratory Medicine(2018)
    Preview abstract Context - Nodal metastasis of a primary tumor influences therapy decisions for a variety of cancers. Histologic identification of tumor cells in lymph nodes can be laborious and error-prone, especially for small tumor foci. Objective - To evaluate the application and clinical implementation of a state-of-the-art deep learning–based artificial intelligence algorithm (LYmph Node Assistant or LYNA) for detection of metastatic breast cancer in sentinel lymph node biopsies. Design - Whole slide images were obtained from hematoxylin-eosin–stained lymph nodes from 399 patients (publicly available Camelyon16 challenge dataset). LYNA was developed by using 270 slides and evaluated on the remaining 129 slides. We compared the findings to those obtained from an independent laboratory (108 slides from 20 patients/86 blocks) using a different scanner to measure reproducibility. Results - LYNA achieved a slide-level area under the receiver operating characteristic (AUC) of 99% and a tumor-level sensitivity of 91% at 1 false positive per patient on the Camelyon16 evaluation dataset. We also identified 2 “normal” slides that contained micrometastases. When applied to our second dataset, LYNA achieved an AUC of 99.6%. LYNA was not affected by common histology artifacts such as overfixation, poor staining, and air bubbles. Conclusions - Artificial intelligence algorithms can exhaustively evaluate every tissue patch on a slide, achieving higher tumor-level sensitivity than, and comparable slide-level performance to, pathologists. These techniques may improve the pathologist's productivity and reduce the number of false negatives associated with morphologic detection of tumor cells. We provide a framework to aid practicing pathologists in assessing such algorithms for adoption into their workflow (akin to how a pathologist assesses immunohistochemistry results). View details