D. Sculley

D. Sculley

I'm currently interested in massive scale machine learning problems for online advertising. My work includes both novel research and applied engineering.

For more details, see my home page.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Adversarial Nibbler: A DataPerf Challenge for Text-to-Image Models
    Hannah Kirk
    Jessica Quaye
    Charvi Rastogi
    Max Bartolo
    Oana Inel
    Meg Risdal
    Will Cukierski
    Vijay Reddy
    Online (2023)
    Preview abstract Machine learning progress has been strongly influenced by the data used for model training and evaluation. Only recently however, have development teams shifted their focus more to the data. This shift has been triggered by the numerous reports about biases and errors discovered in AI datasets. Thus, the data-centric AI movement introduced the notion of iterating on the data used in AI systems, as opposed to the traditional model-centric AI approach, which typically treats the data as a given static artifact in model development. With the recent advancement of generative AI, the role of data is even more crucial for successfully developing more factual and safe models. DataPerf challenges follow up on recent successful data- centric challenges drawing attention to the data used for training and evaluation of machine learning model. Specifically, Adversarial Nibbler focuses on data used for safety evaluation of generative text-to-image models. A typical bottleneck in safety evaluation is achieving a representative diversity and coverage of different types of examples in the evaluation set. Our competition aims to gather a wide range of long-tail and unexpected failure modes for text-to-image models in order to identify as many new problems as possible and use various automated approaches to expand the dataset to be useful for training, fine tuning, and evaluation. View details
    Plex: Towards Reliability using Pretrained Large Model Extensions
    Du Phan
    Mark Patrick Collier
    Zi Wang
    Zelda Mariet
    Clara Huiyi Hu
    Neil Band
    Tim G. J. Rudner
    Karan Singhal
    Joost van Amersfoort
    Andreas Christian Kirsch
    Rodolphe Jenatton
    Honglin Yuan
    Kelly Buchanan
    Yarin Gal
    ICML 2022 Pre-training Workshop (2022)
    Preview abstract A recent trend in artificial intelligence (AI) is the use of pretrained models for language and vision tasks, which has achieved extraordinary performance but also puzzling failures. Examining tasks that probe the model’s abilities in diverse ways is therefore critical to the field. In this paper, we explore the \emph{reliability} of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks such as uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot learning). We devise 11 types of tasks over 36 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, \emph{p}retrained \emph{l}arge-model \emph{ex}tensions (henceforth abbreviated as \emph{plex}) for vision and language modalities. Plex greatly improves the state-of-the-art across tasks, and as a pretrained model Plex unifies the traditional protocol of designing and tuning one model for each reliability task. We demonstrate scaling effects over model sizes and pretraining dataset sizes up to 4 billion examples. We also demonstrate Plex’s capabilities on new tasks including zero-shot open set recognition, few-shot uncertainty, and uncertainty in conversational language understanding. View details
    Chapter 1B "Data Management Principles" _Reliable Machine Learning: Applying SRE Principles to ML in Production_
    Cathy Chen
    Kranti Parisa
    Niall Richard Murphy
    Todd Underwood
    Reliable Machine Learning: Applying SRE Principles to ML in Production, O'Reilly (2022)
    Preview abstract Machine learning is rapidly becoming a vital tool for many organizations today. It’s used to increase revenue, optimise decision making, understand customer behaviour (and influence it), and solve problems across a very wide set of domains, in some cases at performance levels significantly superior to human ones. Machine learning touches billions of people multiple times a day. Yet, industry-wide, the state of how organizations implement ML is, frankly, very poor. There isn’t even a framework describing how best to do it - most people are just making it up as they go along. There are many consequences to this, including poorer quality outcomes for both user and organization, lost revenue opportunities, legal exposures, and so on. Even worse is the fact that data, key to the success of ML, has become both a vitally important asset and a critical liability: organizations have not internalized how to manage it. For all these reasons and more, the industry needs a framework -- a way of understanding the issues around running actual, reliable, production-quality ML systems, and a collection of the actual practical and conceptual approaches to “reliable ML for everyone”. That makes it natural to reach for the conceptual framework provided by the Site Reliability Engineering discipline to provide that understanding. Bringing SRE approaches to running production systems helps them to be reliable, to scale well, to be well monitored, managed, and useful for customers; analogously, SRE approaches (including the Dickerson hierarchy, SLO & SLIs, effective data handling, and so on) for machine learning help to accomplish the same ends. Yet SRE approaches are not the totality of the story. We provide guidance for model developers, data scientists, and business owners to perform the nuts and bolts of their day to day jobs, while also keeping the bigger picture in mind. In other words, this book applies an SRE mindset to machine learning, and shows how to run an effective, efficient, and reliable ML system, whether you are a small startup or a planet-spanning megacorp. It will describe what to do whether you are starting from a completely blank slate, or have significant scale already. It will describe operational approaches, data-centric ways of thinking about production systems, and ethical guidelines - increasing important in today’s world. View details
    Large-scale machine learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology
    Babak Alipanahi
    Babak Behsaz
    Zachary Ryan Mccaw
    Emanuel Schorsch
    Lizzie Dorfman
    Sonia Phene
    Andrew Walker Carroll
    Anthony Khawaja
    American Journal of Human Genetics (2021)
    Preview abstract Genome-wide association studies (GWAS) require accurate cohort phenotyping, but expert labeling can be costly, time-intensive, and variable. Here we develop a machine learning (ML) model to predict glaucomatous features from color fundus photographs. We used the model to predict vertical cup-to-disc ratio (VCDR), a diagnostic parameter and cardinal endophenotype for glaucoma, in 65,680 Europeans in the UK Biobank (UKB). A GWAS of ML-based VCDR identified 299 independent genome-wide significant (GWS; P≤5×10-8) hits in 156 loci. The ML-based GWAS replicated 62 of 65 GWS loci from a recent VCDR GWAS in the UKB for which two ophthalmologists manually labeled images for 67,040 Europeans. The ML-based GWAS also identified 93 novel loci, significantly expanding our understanding of the genetic etiologies of glaucoma and VCDR. Pathway analyses support the biological significance of the novel hits to VCDR, with select loci near genes involved in neuronal and synaptic biology or known to cause severe Mendelian ophthalmic disease. Finally, the ML-based GWAS results significantly improve polygenic prediction of VCDR in independent datasets. View details
    Preview abstract Severe speech impairments limit the precision and range of producible speech sounds. As a result, generic automatic speech recognition (ASR) and keyword spotting (KWS) systems are unable to accurately recognize the utterances produced by individuals with severe speech impairments. This paper describes an approach in which simple speech sounds, namely isolated open vowels (e.g., /a/), are used in lieu of more motorically-demanding keywords. A neural network (NN) is trained to detect these isolated open vowels uttered by individuals with speech impairments against background noise. The NN is trained with a two-phase approach. The pre-training phase uses samples from unimpaired speakers along with samples of background noises and unrelated speech; then the fine-tuning stage uses samples of vowel samples collected from individuals with speech impairments. This model can be built into an experimental mobile app that allows users to activate preconfigured actions such as alerting caregivers. Preliminary user testing indicates the model has the potential to be a useful and flexible emergency communication channel for motor- and speech-impaired individuals. View details
    Preview abstract The use of black-box optimization for the design of new biological sequences is an emerging research area with potentially revolutionary impact. The cost and latency of wet-lab experiments requires methods that find good sequences in few experimental rounds of large batches of sequences --- a setting that off-the-shelf black-box optimization methods are ill-equipped to handle. We find that the performance of existing methods varies drastically across optimization tasks, posing a significant obstacle to real-world applications. To improve robustness, we propose population-based optimization (PBO), which generates batches of sequences by sampling from an ensemble of methods. The number of sequences sampled from any method is proportional to the quality of sequences it previously proposed, allowing PBO to combine the strengths of individual methods while hedging against their innate brittleness. Adapting the population of methods online using evolutionary optimization further improves performance. Through extensive experiments on in-silico optimization tasks, we show that PBO outperforms any single method in its population, proposing both higher quality single sequences as well as more diverse batches. By its robustness and ability to design diverse, high-quality sequences, PBO is shown to be a new state-of-the art approach to the batched black-box optimization of biological sequences. View details
    Underspecification Presents Challenges for Credibility in Modern Machine Learning
    Dan Moldovan
    Ben Adlam
    Babak Alipanahi
    Alex Beutel
    Christina Chen
    Jon Deaton
    Shaobo Hou
    Neil Houlsby
    Ghassen Jerfel
    Yian Ma
    Diana Mincu
    Akinori Mitani
    Andrea Montanari
    Christopher Nielsen
    Thomas Osborne
    Rajiv Raman
    Kim Ramasamy
    Martin Gamunu Seneviratne
    Shannon Sequeira
    Harini Suresh
    Victor Veitch
    Steve Yadlowsky
    Journal of Machine Learning Research (2020)
    Preview abstract ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain. View details
    Preview abstract Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive {\em uncertainty}. Quantifying uncertainty is especially critical in real-world settings, which often involve distributions that are skewed from the training distribution due to a variety of factors including sample bias and non-stationarity. In such settings, well calibrated uncertainty estimates convey information about when a model's output should (or should not) be trusted. Many probabilistic deep learning methods, including Bayesian-and non-Bayesian methods, have been proposed in the literature for quantifying predictive uncertainty, but to our knowledge there has not previously been a rigorous large-scale empirical comparison of these methods under conditions of distributional skew. We present a large-scale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of distributional skew on accuracy and calibration. We find that traditional post-hoc calibration falls short and some Bayesian methods are intractable for very large data. However, methods that marginalize over models give surprisingly strong results across a broad spectrum. View details
    The Inclusive Images Competition
    Igor Ivanov
    Miha Skalic
    Pallavi Baljekar
    Pavel Ostyakov
    Roman Solovyev
    Weimin Wang
    Yoni Halpern
    Springer Series (2019)
    Preview abstract Popular large image classification datasets that are drawn from the web present Eurocentric and Americentric biases that negatively impact the generalizability of models trained on them . In order to encourage the development of modeling approaches that generalize well to images drawn from locations and cultural contexts that are unseen or poorly represented at the time of training, we organized the Inclusive Images competition in association with Kaggle and the NeurIPS 2018 Competition Track Workshop. In this chapter, we describe the motivation and design of the competition, present reports from the top three competitors, and provide high-level takeaways from the competition results. View details
    Preview abstract Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate $\sim1/3$ of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. To address this, we report a deep learning model that learns the relationship between unaligned amino acid sequences and their functional classification across all 17929 families of the PFam database. Using the Pfam seed sequences we establish a rigorous benchmark assessment and find that a dilated convolutional model reduces the error of state of the art BLASTp and pHMM models by a factor of nine. With 80\% of the full Pfam database we train a protein family predictor that is more accurate and over 200 times faster than BLASTp, while learning sequence features such as structural disorder and transmembrane helices. Our model co-locates sequences from unseen families in embedding space far from existing families, allowing sequences from novel families to be classified. We anticipate that deep learning models will be a core component of future general-purpose protein function prediction tools. View details