George Dahl
George Dahl received his Ph.D. from the University of Toronto under the supervision of Geoff Hinton, where he worked on deep learning approaches to problems in speech recognition, computational chemistry, and natural language text processing, including some of the first successful deep acoustic models. He has been a research scientist at Google on the Brain team since 2015. His research focuses on highly flexible models that learn their own features, end-to-end, and make efficient use of data and computation for supervised, unsupervised, and reinforcement learning. In particular, he is interested in applications to linguistic and perceptual data as well as chemical, biological, and medical data.
Google Scholar profile
Authored Publications
Sort By
Adaptive Gradient Methods at the Edge of Stability
Behrooz Ghorbani
David Cardoze
Jeremy Cohen
Justin Gilmer
Naman Agarwal
Shankar Krishnan
NeuRIPS 2022 (2022) (to appear)
Preview abstract
Little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we show that during full-batch training, the maximum eigenvalue of the \emph{preconditioned} Hessian typically equilibrates at the stability threshold of a related non-adaptive algorithm. For Adam with step size $\eta$ and $\beta_1 = 0.9$, this stability threshold is $38/\eta$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the “Edge of Stability,” their behavior in this regime differs in a crucial way from that of their non-adaptive counterparts. Whereas non-adaptive algorithms are forced to remain in low-curvature regions of the loss landscape, we demonstrate that adaptive gradient methods often advance into high-curvature regions, while adapting the preconditioner to compensate. We believe that our findings will serve as a foundation for the community’s future understanding of adaptive gradient methods in deep learning.
View details
A Loss Curvature Perspective On Training Instability in Deep Learning
Justin Gilmer
Behrooz Ghorbani
Ankush Garg
Behnam Neyshabur
David Cardoze
ICLR (2022)
Preview abstract
In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics such as gradient clipping and learning rate warmup. Our results demonstrate that successful model and hyperparameter choices allow the early optimization trajectory to either avoid---or navigate out of---regions of high curvature and into flatter regions that tolerate a higher learning rate. Our results suggest a unifying perspective on how disparate mitigation strategies for training instability ultimately address the same underlying failure mode of neural network optimization, namely poor conditioning. Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization, layer normalization, MetaInit, GradInit, and Fixup initialization.
View details
A mobile-optimized artificial intelligence system for gestational age and fetal malpresentation assessment
Ryan Gomes
Bellington Vwalika
Chace Lee
Angelica Willis
Joan T. Price
Christina Chen
Margaret P. Kasaro
James A. Taylor
Elizabeth M. Stringer
Scott Mayer McKinney
Ntazana Sindano
William Goodnight, III
Justin Gilmer
Benjamin H. Chi
Charles Lau
Terry Spitz
Kris Liu
Jonny Wong
Rory Pilgrim
Akib Uddin
Lily Hao Yi Peng
Kat Chou
Jeffrey S. A. Stringer
Shravya Ramesh Shetty
Communications Medicine (2022)
Preview abstract
Background
Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption in low-to-middle-income countries. This study investigated the use of artificial intelligence for fetal ultrasound in under-resourced settings.
Methods
Blind sweep ultrasounds, consisting of six freehand ultrasound sweeps, were collected by sonographers in the USA and Zambia, and novice operators in Zambia. We developed artificial intelligence (AI) models that used blind sweeps to predict gestational age (GA) and fetal malpresentation. AI GA estimates and standard fetal biometry estimates were compared to a previously established ground truth, and evaluated for difference in absolute error. Fetal malpresentation (non-cephalic vs cephalic) was compared to sonographer assessment. On-device AI model run-times were benchmarked on Android mobile phones.
Results
Here we show that GA estimation accuracy of the AI model is non-inferior to standard fetal biometry estimates (error difference -1.4 ± 4.5 days, 95% CI -1.8, -0.9, n=406). Non-inferiority is maintained when blind sweeps are acquired by novice operators performing only two of six sweep motion types. Fetal malpresentation AUC-ROC is 0.977 (95% CI, 0.949, 1.00, n=613), sonographers and novices have similar AUC-ROC. Software run-times on mobile phones for both diagnostic models are less than 3 seconds after completion of a sweep.
Conclusions
The gestational age model is non-inferior to the clinical standard and the fetal malpresentation model has high AUC-ROCs across operators and devices. Our AI models are able to run on-device, without internet connectivity, and provide feedback scores to assist in upleveling the capabilities of lightly trained ultrasound operators in low resource settings.
View details
Machine learning guided aptamer discovery
Ali Bashir
Geoff Davis
Michelle Therese Dimon
Qin Yang
Scott Ferguson
Zan Armstrong
Nature Communications (2021)
Preview abstract
Aptamers are discovered by searching a large library for sequences with desirable binding properties. These libraries, however, are physically constrained to a fraction of the theoretical sequence space and limited to sampling strategies that are easy to scale. Integrating machine learning could enable identification of high-performing aptamers across this unexplored fitness landscape. We employed particle display (PD) to partition aptamers by affinity and trained neural network models to improve physically-derived aptamers and predict affinity in silico. These predictions were used to locally improve physically derived aptamers as well as identify completely novel, high-affinity aptamers de novo. We experimentally validated the predictions, improving aptamer candidate designs at a rate 10-fold higher than random perturbation, and generating novel aptamers at a rate 448-fold higher than PD alone. We characterized the explanatory power of the models globally and locally and showed successful sequence truncation while maintaining affinity. This work combines machine learning and physical discovery, uses principles that are widely applicable to other display technologies, and provides a path forward for better diagnostic and therapeutic agents.
View details
Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
Guodong Zhang
James Martens
Sushant Sachdeva
Chris Shallue
Roger Grosse
2019 Conference on Neural Information Processing Systems (2019)
Preview abstract
Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model (NQM). We experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and K-FAC, result in much larger critical batch sizes than stochastic gradient descent with momentum. We also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with. The NQM predicts our results with preconditioned optimizers, previous results with accelerated gradient descent, and other results around optimal learning rates and large batch training, making it a useful tool to generate testable predictions about neural network optimization.
View details
Peptide-Spectra Matching with Weak Supervision
Sam Schoenholz
Sean Hackett
Laura Deming
Eugene Melamud
Navdeep Jaitly
Fiona McAllister
Jonathon O'Brien
Bryson Bennett
Daphne Koller
arXiv (2018)
Preview abstract
As in many other scientific domains, we face a fundamental problem when using
machine learning to identify proteins from mass spectrometry data: large ground
truth datasets mapping inputs to correct outputs are extremely difficult to obtain.
Instead, we have access to imperfect hand-coded models crafted by domain experts.
In this paper, we apply deep neural networks to an important step of the protein
identification problem, the pairing of mass spectra with short sequences of amino
acids called peptides. We train our model to differentiate between top scoring
results from a state-of-the art classical system and hard-negative second and third
place results. Our resulting model is much better at identifying peptides with
spectra than the model used to generate its training data. In particular, we achieve
a 43% improvement over standard matching methods and a 10% improvement
over a combination of the matching method and an industry standard cross-spectra
reranking tool. Importantly, in a more difficult experimental regime that reflects
current challenges facing biologists, our advantage over the previous state-of-theart
grows to 15% even after reranking. We believe this approach will generalize to
other challenging scientific problems.
View details
Artificial Intelligence Based Breast Cancer Nodal Metastasis Detection: Insights into the Black Box for Pathologists
Timo Kohlberger
Mohammad Norouzi
Jenny Smith
Arash Mohtashamian
Niels Olson
Lily Peng
Jason Hipp
Martin Stumpe
Archives of Pathology & Laboratory Medicine (2018)
Preview abstract
Context - Nodal metastasis of a primary tumor influences therapy decisions for a variety of cancers. Histologic identification of tumor cells in lymph nodes can be laborious and error-prone, especially for small tumor foci.
Objective - To evaluate the application and clinical implementation of a state-of-the-art deep learning–based artificial intelligence algorithm (LYmph Node Assistant or LYNA) for detection of metastatic breast cancer in sentinel lymph node biopsies.
Design - Whole slide images were obtained from hematoxylin-eosin–stained lymph nodes from 399 patients (publicly available Camelyon16 challenge dataset). LYNA was developed by using 270 slides and evaluated on the remaining 129 slides. We compared the findings to those obtained from an independent laboratory (108 slides from 20 patients/86 blocks) using a different scanner to measure reproducibility.
Results - LYNA achieved a slide-level area under the receiver operating characteristic (AUC) of 99% and a tumor-level sensitivity of 91% at 1 false positive per patient on the Camelyon16 evaluation dataset. We also identified 2 “normal” slides that contained micrometastases. When applied to our second dataset, LYNA achieved an AUC of 99.6%. LYNA was not affected by common histology artifacts such as overfixation, poor staining, and air bubbles.
Conclusions - Artificial intelligence algorithms can exhaustively evaluate every tissue patch on a slide, achieving higher tumor-level sensitivity than, and comparable slide-level performance to, pathologists. These techniques may improve the pathologist's productivity and reduce the number of false negatives associated with morphologic detection of tumor cells. We provide a framework to aid practicing pathologists in assessing such algorithms for adoption into their workflow (akin to how a pathologist assesses immunohistochemistry results).
View details
Preview abstract
Advances in machine learning have led to broad deployment of systems with impressive
performance on important problems. Nonetheless, these systems can be induced
to make errors on data that are surprisingly similar to examples the learned system
handles correctly. The existence of these errors raises a variety of questions about
out-of-sample generalization and whether bad actors might use such examples to abuse
deployed systems. As a result of these security concerns, there has been a flurry of
recent papers proposing algorithms to defend against such malicious perturbations of
correctly handled examples. It is unclear how such misclassifications represent a different
kind of security problem than other errors, or even other attacker-produced
examples that have no specific relationship to an uncorrupted input. In this paper,
we argue that adversarial example defense papers have, to date, mostly considered
abstract, toy games that do not relate to any specific security concern. Furthermore,
defense papers have not yet precisely described all the abilities and limitations of attackers
that would be relevant in practical security. Towards this end, we establish a
taxonomy of motivations, constraints, and abilities for more plausible adversaries. Finally,
we provide a series of recommendations outlining a path forward for future work
to more clearly articulate the threat model and perform more meaningful evaluation.
View details
Large scale distributed neural network training through online distillation
Rohan Anil
Gabriel Pereyra
Alexandre Tachard Passos
Robert Ormandi
Geoffrey Hinton
ICLR (2018)
Preview abstract
While techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model they are seldom used as the multi-stage training setups they require are cumbersome and the extra hyperparameters introduced make the process of tuning even more expensive. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup. We also show that distillation can be used as a meaningful distributed learning algorithm: instead of independent workers exchanging gradients, which requires worrying about delays and synchronization, independent workers can exchange full model checkpoints. This can be done far less frequently than exchanging gradients, breaking one of the scalability barriers of stochastic gradient descent. We have experiments on Criteo clickthrough rate, and the largest to-date dataset used for neural language modeling, based on Common Crawl and containing $6\times 10^{11}$ tokens. In these experiments we show we can scale at least $2\times$ as well as the maximum limit of distributed stochastic gradient descent. Finally, we also show that online distillation can dramatically reduce the churn in the predictions between different versions of a model.
View details
Relational inductive biases, deep learning, and graph networks
Peter Battaglia
Jessica Blake Chandler Hamrick
Victor Bapst
Alvaro Sanchez
Vinicius Zambaldi
Mateusz Malinowski
Andrea Tacchetti
David Raposo
Adam Santoro
Ryan Faulkner
Caglar Gulcehre
Francis Song
Andy Ballard
Justin Gilmer
Ashish Vaswani
Kelsey Allen
Charles Nash
Victoria Jayne Langston
Chris Dyer
Nicolas Heess
Daan Wierstra
Matt Botvinick
Yujia Li
Razvan Pascanu
arXiv (2018)
Preview abstract
The purpose of this paper is to explore relational inductive biases in modern AI, especially
deep learning, describing a rough taxonomy of existing approaches, and introducing a common
mathematical framework for expressing and unifying various approaches. The key theme running through this work is structure—how the world is structured, and how the structure of different computational strategies determines their strengths and weaknesses.
View details