D. Sculley
I'm currently interested in massive scale machine learning problems for online advertising. My work includes both novel research and applied engineering.
For more details, see my home page.
Authored Publications
Sort By
Adversarial Nibbler: A DataPerf Challenge for Text-to-Image Models
Hannah Kirk
Jessica Quaye
Charvi Rastogi
Max Bartolo
Oana Inel
Meg Risdal
Will Cukierski
Vijay Reddy
Online (2023)
Preview abstract
Machine learning progress has been strongly influenced by the data used for
model training and evaluation. Only recently however, have development teams
shifted their focus more to the data. This shift has been triggered by the numerous
reports about biases and errors discovered in AI datasets. Thus, the data-centric
AI movement introduced the notion of iterating on the data used in AI systems, as
opposed to the traditional model-centric AI approach, which typically treats the
data as a given static artifact in model development. With the recent advancement of
generative AI, the role of data is even more crucial for successfully developing more
factual and safe models. DataPerf challenges follow up on recent successful data-
centric challenges drawing attention to the data used for training and evaluation of
machine learning model. Specifically, Adversarial Nibbler focuses on data used for
safety evaluation of generative text-to-image models. A typical bottleneck in safety
evaluation is achieving a representative diversity and coverage of different types
of examples in the evaluation set. Our competition aims to gather a wide range
of long-tail and unexpected failure modes for text-to-image models in order to
identify as many new problems as possible and use various automated approaches
to expand the dataset to be useful for training, fine tuning, and evaluation.
View details
Plex: Towards Reliability using Pretrained Large Model Extensions
Du Phan
Mark Patrick Collier
Zi Wang
Zelda Mariet
Clara Huiyi Hu
Neil Band
Tim G. J. Rudner
Karan Singhal
Joost van Amersfoort
Andreas Christian Kirsch
Rodolphe Jenatton
Honglin Yuan
Kelly Buchanan
Yarin Gal
ICML 2022 Pre-training Workshop (2022)
Preview abstract
A recent trend in artificial intelligence (AI) is the use of pretrained models for language and vision tasks, which has achieved extraordinary performance but also puzzling failures. Examining tasks that probe the model’s abilities in diverse ways is therefore critical to the field. In this paper, we explore the \emph{reliability} of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks such as uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot learning). We devise 11 types of tasks over 36 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, \emph{p}retrained \emph{l}arge-model \emph{ex}tensions (henceforth abbreviated as \emph{plex}) for vision and language modalities. Plex greatly improves the state-of-the-art across tasks, and as a pretrained model Plex unifies the traditional protocol of designing and tuning one model for each reliability task. We demonstrate scaling effects over model sizes and pretraining dataset sizes up to 4 billion examples. We also demonstrate Plex’s capabilities on new tasks including zero-shot open set recognition, few-shot uncertainty, and uncertainty in conversational language understanding.
View details
Chapter 1B "Data Management Principles" _Reliable Machine Learning: Applying SRE Principles to ML in Production_
Cathy Chen
Kranti Parisa
Niall Richard Murphy
Todd Underwood
Reliable Machine Learning: Applying SRE Principles to ML in Production, O'Reilly (2022)
Preview abstract
Machine learning is rapidly becoming a vital tool for many organizations today. It’s used to increase revenue, optimise decision making, understand customer behaviour (and influence it), and solve problems across a very wide set of domains, in some cases at performance levels significantly superior to human ones. Machine learning touches billions of people multiple times a day.
Yet, industry-wide, the state of how organizations implement ML is, frankly, very poor. There isn’t even a framework describing how best to do it - most people are just making it up as they go along. There are many consequences to this, including poorer quality outcomes for both user and organization, lost revenue opportunities, legal exposures, and so on. Even worse is the fact that data, key to the success of ML, has become both a vitally important asset and a critical liability: organizations have not internalized how to manage it.
For all these reasons and more, the industry needs a framework -- a way of understanding the issues around running actual, reliable, production-quality ML systems, and a collection of the actual practical and conceptual approaches to “reliable ML for everyone”. That makes it natural to reach for the conceptual framework provided by the Site Reliability Engineering discipline to provide that understanding. Bringing SRE approaches to running production systems helps them to be reliable, to scale well, to be well monitored, managed, and useful for customers; analogously, SRE approaches (including the Dickerson hierarchy, SLO & SLIs, effective data handling, and so on) for machine learning help to accomplish the same ends.
Yet SRE approaches are not the totality of the story. We provide guidance for model developers, data scientists, and business owners to perform the nuts and bolts of their day to day jobs, while also keeping the bigger picture in mind. In other words, this book applies an SRE mindset to machine learning, and shows how to run an effective, efficient, and reliable ML system, whether you are a small startup or a planet-spanning megacorp. It will describe what to do whether you are starting from a completely blank slate, or have significant scale already. It will describe operational approaches, data-centric ways of thinking about production systems, and ethical guidelines - increasing important in today’s world.
View details
Large-scale machine learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology
Babak Alipanahi
Babak Behsaz
Zachary Ryan Mccaw
Emanuel Schorsch
Lizzie Dorfman
Sonia Phene
Andrew Walker Carroll
Anthony Khawaja
American Journal of Human Genetics (2021)
Preview abstract
Genome-wide association studies (GWAS) require accurate cohort phenotyping, but expert labeling can be costly, time-intensive, and variable. Here we develop a machine learning (ML) model to predict glaucomatous features from color fundus photographs. We used the model to predict vertical cup-to-disc ratio (VCDR), a diagnostic parameter and cardinal endophenotype for glaucoma, in 65,680 Europeans in the UK Biobank (UKB). A GWAS of ML-based VCDR identified 299 independent genome-wide significant (GWS; P≤5×10-8) hits in 156 loci. The ML-based GWAS replicated 62 of 65 GWS loci from a recent VCDR GWAS in the UKB for which two ophthalmologists manually labeled images for 67,040 Europeans. The ML-based GWAS also identified 93 novel loci, significantly expanding our understanding of the genetic etiologies of glaucoma and VCDR. Pathway analyses support the biological significance of the novel hits to VCDR, with select loci near genes involved in neuronal and synaptic biology or known to cause severe Mendelian ophthalmic disease. Finally, the ML-based GWAS results significantly improve polygenic prediction of VCDR in independent datasets.
View details
A Voice-Activated Switch for Persons with Motor and Speech Impairments: Isolated-Vowel Spotting Using Neural Networks
Lisie Lillianfeld
Katie Seaver
Jordan R. Green
InterSpeech 2021 (2021)
Preview abstract
Severe speech impairments limit the precision and range of producible speech sounds. As a result, generic automatic speech recognition (ASR) and keyword spotting (KWS) systems are unable to accurately recognize the utterances produced by individuals with severe speech impairments. This paper describes an approach in which simple speech sounds, namely isolated open vowels (e.g., /a/), are used in lieu of more motorically-demanding keywords. A neural network (NN) is trained to detect these isolated open vowels uttered by individuals with speech impairments against background noise. The NN is trained with a two-phase approach. The pre-training phase uses samples from unimpaired speakers along with samples of background noises and unrelated speech; then the fine-tuning stage uses samples of vowel samples collected from individuals with speech impairments. This model can be built into an experimental mobile app that allows users to activate preconfigured actions such as alerting caregivers. Preliminary user testing indicates the model has the potential to be a useful and flexible emergency communication channel for motor- and speech-impaired individuals.
View details
Underspecification Presents Challenges for Credibility in Modern Machine Learning
Dan Moldovan
Ben Adlam
Babak Alipanahi
Alex Beutel
Christina Chen
Jon Deaton
Matthew D. Hoffman
Shaobo Hou
Neil Houlsby
Ghassen Jerfel
Yian Ma
Diana Mincu
Akinori Mitani
Andrea Montanari
Christopher Nielsen
Thomas Osborne
Rajiv Raman
Kim Ramasamy
Martin Gamunu Seneviratne
Shannon Sequeira
Harini Suresh
Victor Veitch
Steve Yadlowsky
Journal of Machine Learning Research (2020)
Preview abstract
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.
View details
Population Based Optimization for Biological Sequence Design
Zelda Mariet
David Martin Dohan
ICML 2020 (2020)
Preview abstract
The use of black-box optimization for the design of new biological sequences is an emerging research area with potentially revolutionary impact. The cost and latency of wet-lab experiments requires methods that find good sequences in few experimental rounds of large batches of sequences --- a setting that off-the-shelf black-box optimization methods are ill-equipped to handle. We find that the performance of existing methods varies drastically across optimization tasks, posing a significant obstacle to real-world applications. To improve robustness, we propose population-based optimization (PBO), which generates batches of sequences by sampling from an ensemble of methods. The number of sequences sampled from any method is proportional to the quality of sequences it previously proposed, allowing PBO to combine the strengths of individual methods while hedging against their innate brittleness. Adapting the population of methods online using evolutionary optimization further improves performance. Through extensive experiments on in-silico optimization tasks, we show that PBO outperforms any single method in its population, proposing both higher quality single sequences as well as more diverse batches. By its robustness and ability to design diverse, high-quality sequences, PBO is shown to be a new state-of-the art approach to the batched black-box optimization of biological sequences.
View details
Preview abstract
Traditionally there has been a perceived trade-off between machine learning code that is easy to write and machine learning code that fast, scalable, or easy to distribute, with platforms like TensorFlow, Theano, PyTorch, and Autograd inhabiting different points along this tradeoff curve. PyTorch and Autograd offer the coding benefits of imperative programming style and accept the computational tradeoff of interpretive overhead. TensorFlow and Theano give the benefit of whole-program optimization based on defined computation graphs, with the trade-off of potentially cumbersome graph-based semantics and associated developer overhead, which become especially apparent for more complex model types that depend on control flow operators. We propose to capture the benefits of both paradigms, using imperative programming style while enabling high performance program optimization, by using staged programming via source code transformation to essentially compile native Python into a lower-level IR like TensorFlow graphs. A key insight is to delay all type-dependent decisions until runtime, via dynamic dispatch. We instantiate these principles in AutoGraph, a piece of software that improves the programming experience of the TensorFlow machine learning library, and demonstrate the strong usability improvements with no loss in performance compared to native TensorFlow graphs.\end{abstract}
View details
Deep Learning Classifies the Protein Universe
Theo Sanderson
Brandon Carter
Mark DePristo
Nature Biotechnology (2019)
Preview abstract
Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate $\sim1/3$ of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. To address this, we report a deep learning model that learns the relationship between unaligned amino acid sequences and their functional classification across all 17929 families of the PFam database. Using the Pfam seed sequences we establish a rigorous benchmark assessment and find that a dilated convolutional model reduces the error of state of the art BLASTp and pHMM models by a factor of nine. With 80\% of the full Pfam database we train a protein family predictor that is more accurate and over 200 times faster than BLASTp, while learning sequence features such as structural disorder and transmembrane helices. Our model co-locates sequences from unseen families in embedding space far from existing families, allowing sequences from novel families to be classified. We anticipate that deep learning models will be a core component of future general-purpose protein function prediction tools.
View details
Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
Yaniv Ovadia
Sebastian Nowozin
Josh Dillon
Advances in Neural Information Processing Systems (2019)
Preview abstract
Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive {\em uncertainty}. Quantifying uncertainty is especially critical in real-world settings, which often involve distributions that are skewed from the training distribution due to a variety of factors including sample bias and non-stationarity. In such settings, well calibrated uncertainty estimates convey information about when a model's output should (or should not) be trusted. Many probabilistic deep learning methods, including Bayesian-and non-Bayesian methods, have been proposed in the literature for quantifying predictive uncertainty, but to our knowledge there has not previously been a rigorous large-scale empirical comparison of these methods under conditions of distributional skew. We present a large-scale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of distributional skew on accuracy and calibration. We find that traditional post-hoc calibration falls short and some Bayesian methods are intractable for very large data. However, methods that marginalize over models give surprisingly strong results across a broad spectrum.
View details