George Dahl
George Dahl received his Ph.D. from the University of Toronto under the supervision of Geoff Hinton, where he worked on deep learning approaches to problems in speech recognition, computational chemistry, and natural language text processing, including some of the first successful deep acoustic models. He has been a research scientist at Google on the Brain team since 2015. His research focuses on highly flexible models that learn their own features, end-to-end, and make efficient use of data and computation for supervised, unsupervised, and reinforcement learning. In particular, he is interested in applications to linguistic and perceptual data as well as chemical, biological, and medical data.
Google Scholar profile
Authored Publications
Sort By
A Loss Curvature Perspective On Training Instability in Deep Learning
Justin Gilmer
Behrooz Ghorbani
Ankush Garg
David Cardoze
ICLR (2022)
Preview abstract
In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics such as gradient clipping and learning rate warmup. Our results demonstrate that successful model and hyperparameter choices allow the early optimization trajectory to either avoid---or navigate out of---regions of high curvature and into flatter regions that tolerate a higher learning rate. Our results suggest a unifying perspective on how disparate mitigation strategies for training instability ultimately address the same underlying failure mode of neural network optimization, namely poor conditioning. Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization, layer normalization, MetaInit, GradInit, and Fixup initialization.
View details
Adaptive Gradient Methods at the Edge of Stability
Behrooz Ghorbani
David Cardoze
Jeremy Cohen
Justin Gilmer
Shankar Krishnan
NeuRIPS 2022 (2022) (to appear)
Preview abstract
Little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we show that during full-batch training, the maximum eigenvalue of the \emph{preconditioned} Hessian typically equilibrates at the stability threshold of a related non-adaptive algorithm. For Adam with step size $\eta$ and $\beta_1 = 0.9$, this stability threshold is $38/\eta$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the “Edge of Stability,” their behavior in this regime differs in a crucial way from that of their non-adaptive counterparts. Whereas non-adaptive algorithms are forced to remain in low-curvature regions of the loss landscape, we demonstrate that adaptive gradient methods often advance into high-curvature regions, while adapting the preconditioner to compensate. We believe that our findings will serve as a foundation for the community’s future understanding of adaptive gradient methods in deep learning.
View details
A mobile-optimized artificial intelligence system for gestational age and fetal malpresentation assessment
Ryan Gomes
Bellington Vwalika
Chace Lee
Angelica Willis
Joan T. Price
Christina Chen
Margaret P. Kasaro
James A. Taylor
Elizabeth M. Stringer
Scott Mayer McKinney
Ntazana Sindano
William Goodnight, III
Justin Gilmer
Benjamin H. Chi
Charles Lau
Terry Spitz
Kris Liu
Jonny Wong
Rory Pilgrim
Akib Uddin
Lily Hao Yi Peng
Kat Chou
Jeffrey S. A. Stringer
Shravya Ramesh Shetty
Communications Medicine (2022)
Preview abstract
Background
Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption in low-to-middle-income countries. This study investigated the use of artificial intelligence for fetal ultrasound in under-resourced settings.
Methods
Blind sweep ultrasounds, consisting of six freehand ultrasound sweeps, were collected by sonographers in the USA and Zambia, and novice operators in Zambia. We developed artificial intelligence (AI) models that used blind sweeps to predict gestational age (GA) and fetal malpresentation. AI GA estimates and standard fetal biometry estimates were compared to a previously established ground truth, and evaluated for difference in absolute error. Fetal malpresentation (non-cephalic vs cephalic) was compared to sonographer assessment. On-device AI model run-times were benchmarked on Android mobile phones.
Results
Here we show that GA estimation accuracy of the AI model is non-inferior to standard fetal biometry estimates (error difference -1.4 ± 4.5 days, 95% CI -1.8, -0.9, n=406). Non-inferiority is maintained when blind sweeps are acquired by novice operators performing only two of six sweep motion types. Fetal malpresentation AUC-ROC is 0.977 (95% CI, 0.949, 1.00, n=613), sonographers and novices have similar AUC-ROC. Software run-times on mobile phones for both diagnostic models are less than 3 seconds after completion of a sweep.
Conclusions
The gestational age model is non-inferior to the clinical standard and the fetal malpresentation model has high AUC-ROCs across operators and devices. Our AI models are able to run on-device, without internet connectivity, and provide feedback scores to assist in upleveling the capabilities of lightly trained ultrasound operators in low resource settings.
View details
Machine learning guided aptamer discovery
Ali Bashir
Geoff Davis
Michelle Therese Dimon
Qin Yang
Scott Ferguson
Zan Armstrong
Nature Communications (2021)
Preview abstract
Aptamers are discovered by searching a large library for sequences with desirable binding properties. These libraries, however, are physically constrained to a fraction of the theoretical sequence space and limited to sampling strategies that are easy to scale. Integrating machine learning could enable identification of high-performing aptamers across this unexplored fitness landscape. We employed particle display (PD) to partition aptamers by affinity and trained neural network models to improve physically-derived aptamers and predict affinity in silico. These predictions were used to locally improve physically derived aptamers as well as identify completely novel, high-affinity aptamers de novo. We experimentally validated the predictions, improving aptamer candidate designs at a rate 10-fold higher than random perturbation, and generating novel aptamers at a rate 448-fold higher than PD alone. We characterized the explanatory power of the models globally and locally and showed successful sequence truncation while maintaining affinity. This work combines machine learning and physical discovery, uses principles that are widely applicable to other display technologies, and provides a path forward for better diagnostic and therapeutic agents.
View details
Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
Guodong Zhang
James Martens
Sushant Sachdeva
Chris Shallue
Roger Grosse
2019 Conference on Neural Information Processing Systems (2019)
Preview abstract
Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model (NQM). We experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and K-FAC, result in much larger critical batch sizes than stochastic gradient descent with momentum. We also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with. The NQM predicts our results with preconditioned optimizers, previous results with accelerated gradient descent, and other results around optimal learning rates and large batch training, making it a useful tool to generate testable predictions about neural network optimization.
View details
Relational inductive biases, deep learning, and graph networks
Peter Battaglia
Jessica Blake Chandler Hamrick
Victor Bapst
Alvaro Sanchez
Vinicius Zambaldi
Mateusz Malinowski
Andrea Tacchetti
David Raposo
Adam Santoro
Ryan Faulkner
Caglar Gulcehre
Francis Song
Andy Ballard
Justin Gilmer
Ashish Vaswani
Kelsey Allen
Charles Nash
Victoria Jayne Langston
Chris Dyer
Nicolas Heess
Daan Wierstra
Matt Botvinick
Yujia Li
Razvan Pascanu
arXiv (2018)
Preview abstract
The purpose of this paper is to explore relational inductive biases in modern AI, especially
deep learning, describing a rough taxonomy of existing approaches, and introducing a common
mathematical framework for expressing and unifying various approaches. The key theme running through this work is structure—how the world is structured, and how the structure of different computational strategies determines their strengths and weaknesses.
View details
Peptide-Spectra Matching with Weak Supervision
Sam Schoenholz
Sean Hackett
Laura Deming
Eugene Melamud
Navdeep Jaitly
Fiona McAllister
Jonathon O'Brien
Bryson Bennett
Daphne Koller
arXiv (2018)
Preview abstract
As in many other scientific domains, we face a fundamental problem when using
machine learning to identify proteins from mass spectrometry data: large ground
truth datasets mapping inputs to correct outputs are extremely difficult to obtain.
Instead, we have access to imperfect hand-coded models crafted by domain experts.
In this paper, we apply deep neural networks to an important step of the protein
identification problem, the pairing of mass spectra with short sequences of amino
acids called peptides. We train our model to differentiate between top scoring
results from a state-of-the art classical system and hard-negative second and third
place results. Our resulting model is much better at identifying peptides with
spectra than the model used to generate its training data. In particular, we achieve
a 43% improvement over standard matching methods and a 10% improvement
over a combination of the matching method and an industry standard cross-spectra
reranking tool. Importantly, in a more difficult experimental regime that reflects
current challenges facing biologists, our advantage over the previous state-of-theart
grows to 15% even after reranking. We believe this approach will generalize to
other challenging scientific problems.
View details
Measuring the Effects of Data Parallelism on Neural Network Training
Chris Shallue
Jaehoon Lee
Jascha Sohl-dickstein
Journal of Machine Learning Research (JMLR) (2018)
Preview abstract
Recent hardware developments have made unprecedented amounts of data parallelism available for accelerating neural network training. Among the simplest ways to harness next-generation accelerators is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error. Eventually, increasing the batch size will no longer reduce the number of training steps required, but the exact relationship between the batch size and how many training steps are necessary is of critical importance to practitioners, researchers, and hardware designers alike. We study how this relationship varies with the training algorithm, model, and data set and find extremely large variation between workloads. Along the way, we reconcile disagreements in the literature on whether batch size affects model quality. Finally, we discuss the implications of our results for efforts to train neural networks much faster in the future.
View details
Preview abstract
Natural language text exhibits implicit hierarchical structure in a variety of respects. Ideally we could incorporate our prior knowledge of the existence of some sort of hierarchy into unsupervised learning algorithms that work on text data. Recent work by Nickel and Kiela (2017) proposed using hyperbolic instead of Euclidean embedding spaces to represent hierarchical data and demonstrated encouraging results on supervised embedding tasks. In this work, apply their approach to unsupervised learning of word and sentence embeddings. Although we obtain mildly positive results, we describe the challenges we faced in using the hyperbolic metric for these problems both in terms of improving performance in downstream tasks and in understanding the learned hierarchical structures.
View details
Preview abstract
Advances in machine learning have led to broad deployment of systems with impressive
performance on important problems. Nonetheless, these systems can be induced
to make errors on data that are surprisingly similar to examples the learned system
handles correctly. The existence of these errors raises a variety of questions about
out-of-sample generalization and whether bad actors might use such examples to abuse
deployed systems. As a result of these security concerns, there has been a flurry of
recent papers proposing algorithms to defend against such malicious perturbations of
correctly handled examples. It is unclear how such misclassifications represent a different
kind of security problem than other errors, or even other attacker-produced
examples that have no specific relationship to an uncorrupted input. In this paper,
we argue that adversarial example defense papers have, to date, mostly considered
abstract, toy games that do not relate to any specific security concern. Furthermore,
defense papers have not yet precisely described all the abilities and limitations of attackers
that would be relevant in practical security. Towards this end, we establish a
taxonomy of motivations, constraints, and abilities for more plausible adversaries. Finally,
we provide a series of recommendations outlining a path forward for future work
to more clearly articulate the threat model and perform more meaningful evaluation.
View details