Jump to Content
D. Sculley

D. Sculley

I'm currently interested in massive scale machine learning problems for online advertising. My work includes both novel research and applied engineering.

For more details, see my home page.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Adversarial Nibbler: A DataPerf Challenge for Text-to-Image Models
    Hannah Kirk
    Jessica Quaye
    Charvi Rastogi
    Max Bartolo
    Oana Inel
    Meg Risdal
    Will Cukierski
    Vijay Reddy
    Lora Aroyo
    Online (2023)
    Preview abstract Machine learning progress has been strongly influenced by the data used for model training and evaluation. Only recently however, have development teams shifted their focus more to the data. This shift has been triggered by the numerous reports about biases and errors discovered in AI datasets. Thus, the data-centric AI movement introduced the notion of iterating on the data used in AI systems, as opposed to the traditional model-centric AI approach, which typically treats the data as a given static artifact in model development. With the recent advancement of generative AI, the role of data is even more crucial for successfully developing more factual and safe models. DataPerf challenges follow up on recent successful data- centric challenges drawing attention to the data used for training and evaluation of machine learning model. Specifically, Adversarial Nibbler focuses on data used for safety evaluation of generative text-to-image models. A typical bottleneck in safety evaluation is achieving a representative diversity and coverage of different types of examples in the evaluation set. Our competition aims to gather a wide range of long-tail and unexpected failure modes for text-to-image models in order to identify as many new problems as possible and use various automated approaches to expand the dataset to be useful for training, fine tuning, and evaluation. View details
    Chapter 1B "Data Management Principles" _Reliable Machine Learning: Applying SRE Principles to ML in Production_
    Cathy Chen
    Kranti Parisa
    Niall Richard Murphy
    Todd Underwood
    Reliable Machine Learning: Applying SRE Principles to ML in Production, O'Reilly (2022)
    Preview abstract Machine learning is rapidly becoming a vital tool for many organizations today. It’s used to increase revenue, optimise decision making, understand customer behaviour (and influence it), and solve problems across a very wide set of domains, in some cases at performance levels significantly superior to human ones. Machine learning touches billions of people multiple times a day. Yet, industry-wide, the state of how organizations implement ML is, frankly, very poor. There isn’t even a framework describing how best to do it - most people are just making it up as they go along. There are many consequences to this, including poorer quality outcomes for both user and organization, lost revenue opportunities, legal exposures, and so on. Even worse is the fact that data, key to the success of ML, has become both a vitally important asset and a critical liability: organizations have not internalized how to manage it. For all these reasons and more, the industry needs a framework -- a way of understanding the issues around running actual, reliable, production-quality ML systems, and a collection of the actual practical and conceptual approaches to “reliable ML for everyone”. That makes it natural to reach for the conceptual framework provided by the Site Reliability Engineering discipline to provide that understanding. Bringing SRE approaches to running production systems helps them to be reliable, to scale well, to be well monitored, managed, and useful for customers; analogously, SRE approaches (including the Dickerson hierarchy, SLO & SLIs, effective data handling, and so on) for machine learning help to accomplish the same ends. Yet SRE approaches are not the totality of the story. We provide guidance for model developers, data scientists, and business owners to perform the nuts and bolts of their day to day jobs, while also keeping the bigger picture in mind. In other words, this book applies an SRE mindset to machine learning, and shows how to run an effective, efficient, and reliable ML system, whether you are a small startup or a planet-spanning megacorp. It will describe what to do whether you are starting from a completely blank slate, or have significant scale already. It will describe operational approaches, data-centric ways of thinking about production systems, and ethical guidelines - increasing important in today’s world. View details
    Plex: Towards Reliability using Pretrained Large Model Extensions
    Dustin Tran
    Du Phan
    Mark Patrick Collier
    Zi Wang
    Zelda Mariet
    Clara Huiyi Hu
    Neil Band
    Tim G. J. Rudner
    Karan Singhal
    Joost van Amersfoort
    Andreas Christian Kirsch
    Rodolphe Jenatton
    Honglin Yuan
    Kelly Buchanan
    Yarin Gal
    Zoubin Ghahramani
    Jasper Roland Snoek
    ICML 2022 Pre-training Workshop (2022)
    Preview abstract A recent trend in artificial intelligence (AI) is the use of pretrained models for language and vision tasks, which has achieved extraordinary performance but also puzzling failures. Examining tasks that probe the model’s abilities in diverse ways is therefore critical to the field. In this paper, we explore the \emph{reliability} of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks such as uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot learning). We devise 11 types of tasks over 36 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, \emph{p}retrained \emph{l}arge-model \emph{ex}tensions (henceforth abbreviated as \emph{plex}) for vision and language modalities. Plex greatly improves the state-of-the-art across tasks, and as a pretrained model Plex unifies the traditional protocol of designing and tuning one model for each reliability task. We demonstrate scaling effects over model sizes and pretraining dataset sizes up to 4 billion examples. We also demonstrate Plex’s capabilities on new tasks including zero-shot open set recognition, few-shot uncertainty, and uncertainty in conversational language understanding. View details
    Large-scale machine learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology
    Babak Alipanahi
    Babak Behsaz
    Zachary Ryan Mccaw
    Emanuel Schorsch
    Lizzie Dorfman
    Sonia Phene
    Naama Hammel
    Andrew Walker Carroll
    Anthony Khawaja
    American Journal of Human Genetics (2021)
    Preview abstract Genome-wide association studies (GWAS) require accurate cohort phenotyping, but expert labeling can be costly, time-intensive, and variable. Here we develop a machine learning (ML) model to predict glaucomatous features from color fundus photographs. We used the model to predict vertical cup-to-disc ratio (VCDR), a diagnostic parameter and cardinal endophenotype for glaucoma, in 65,680 Europeans in the UK Biobank (UKB). A GWAS of ML-based VCDR identified 299 independent genome-wide significant (GWS; P≤5×10-8) hits in 156 loci. The ML-based GWAS replicated 62 of 65 GWS loci from a recent VCDR GWAS in the UKB for which two ophthalmologists manually labeled images for 67,040 Europeans. The ML-based GWAS also identified 93 novel loci, significantly expanding our understanding of the genetic etiologies of glaucoma and VCDR. Pathway analyses support the biological significance of the novel hits to VCDR, with select loci near genes involved in neuronal and synaptic biology or known to cause severe Mendelian ophthalmic disease. Finally, the ML-based GWAS results significantly improve polygenic prediction of VCDR in independent datasets. View details
    Preview abstract Severe speech impairments limit the precision and range of producible speech sounds. As a result, generic automatic speech recognition (ASR) and keyword spotting (KWS) systems are unable to accurately recognize the utterances produced by individuals with severe speech impairments. This paper describes an approach in which simple speech sounds, namely isolated open vowels (e.g., /a/), are used in lieu of more motorically-demanding keywords. A neural network (NN) is trained to detect these isolated open vowels uttered by individuals with speech impairments against background noise. The NN is trained with a two-phase approach. The pre-training phase uses samples from unimpaired speakers along with samples of background noises and unrelated speech; then the fine-tuning stage uses samples of vowel samples collected from individuals with speech impairments. This model can be built into an experimental mobile app that allows users to activate preconfigured actions such as alerting caregivers. Preliminary user testing indicates the model has the potential to be a useful and flexible emergency communication channel for motor- and speech-impaired individuals. View details
    Preview abstract The use of black-box optimization for the design of new biological sequences is an emerging research area with potentially revolutionary impact. The cost and latency of wet-lab experiments requires methods that find good sequences in few experimental rounds of large batches of sequences --- a setting that off-the-shelf black-box optimization methods are ill-equipped to handle. We find that the performance of existing methods varies drastically across optimization tasks, posing a significant obstacle to real-world applications. To improve robustness, we propose population-based optimization (PBO), which generates batches of sequences by sampling from an ensemble of methods. The number of sequences sampled from any method is proportional to the quality of sequences it previously proposed, allowing PBO to combine the strengths of individual methods while hedging against their innate brittleness. Adapting the population of methods online using evolutionary optimization further improves performance. Through extensive experiments on in-silico optimization tasks, we show that PBO outperforms any single method in its population, proposing both higher quality single sequences as well as more diverse batches. By its robustness and ability to design diverse, high-quality sequences, PBO is shown to be a new state-of-the art approach to the batched black-box optimization of biological sequences. View details
    Underspecification Presents Challenges for Credibility in Modern Machine Learning
    Alexander Nicholas D'Amour
    Dan Moldovan
    Babak Alipanahi
    Alex Beutel
    Christina Chen
    Jon Deaton
    Jacob Eisenstein
    Shaobo Hou
    Ghassen Jerfel
    Yian Ma
    Akinori Mitani
    Andrea Montanari
    Christopher Nielsen
    Thomas Osborne
    Rajiv Raman
    Kim Ramasamy
    Rory Abbott Sayres
    Martin Gamunu Seneviratne
    Shannon Sequeira
    Harini Suresh
    Victor Veitch
    Journal of Machine Learning Research (2020)
    Preview abstract ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain. View details
    TensorFlow.js: Machine Learning for the Web and Beyond
    Daniel Smilkov
    Nikhil Thorat
    Yannick Assogba
    Ann Yuan
    Nick Kreeger
    Ping Yu
    Kangyi Zhang
    Eric Nielsen
    Stan Bileschi
    Charles Nicholson
    Sandeep N. Gupta
    Sarah Sirajuddin
    Rajat Monga
    SysML, Palo Alto, CA, USA (2019)
    Preview abstract TensorFlow.js is a library for building and executing machine learning algorithms in JavaScript. TensorFlow.js models run in a web browser and in the Node.js environment. The library is part of the TensorFlow ecosystem, providing a set of APIs that are compatible with those in Python, allowing models to be ported between the Python and JavaScript ecosystems. TensorFlow.js has empowered a new set of developers from the extensive JavaScript community to build and deploy machine learning models and enabled new classes of on-device computation. This paper describes the design, API, and implementation of TensorFlow.js, and highlights some of the impactful use cases. View details
    The Inclusive Images Competition
    Igor Ivanov
    Miha Skalic
    Pallavi Baljekar
    Pavel Ostyakov
    Roman Solovyev
    Weimin Wang
    Yoni Halpern
    Springer Series (2019)
    Preview abstract Popular large image classification datasets that are drawn from the web present Eurocentric and Americentric biases that negatively impact the generalizability of models trained on them . In order to encourage the development of modeling approaches that generalize well to images drawn from locations and cultural contexts that are unseen or poorly represented at the time of training, we organized the Inclusive Images competition in association with Kaggle and the NeurIPS 2018 Competition Track Workshop. In this chapter, we describe the motivation and design of the competition, present reports from the top three competitors, and provide high-level takeaways from the competition results. View details
    AutoGraph: Imperative-style Coding with Graph-based Performance
    Dan Moldovan
    James Decker
    Fei Wang
    Andrew Johnson
    Brian Lee
    Tiark Rompf
    Alexander B Wiltschko
    SysML (2019)
    Preview abstract Traditionally there has been a perceived trade-off between machine learning code that is easy to write and machine learning code that fast, scalable, or easy to distribute, with platforms like TensorFlow, Theano, PyTorch, and Autograd inhabiting different points along this tradeoff curve. PyTorch and Autograd offer the coding benefits of imperative programming style and accept the computational tradeoff of interpretive overhead. TensorFlow and Theano give the benefit of whole-program optimization based on defined computation graphs, with the trade-off of potentially cumbersome graph-based semantics and associated developer overhead, which become especially apparent for more complex model types that depend on control flow operators. We propose to capture the benefits of both paradigms, using imperative programming style while enabling high performance program optimization, by using staged programming via source code transformation to essentially compile native Python into a lower-level IR like TensorFlow graphs. A key insight is to delay all type-dependent decisions until runtime, via dynamic dispatch. We instantiate these principles in AutoGraph, a piece of software that improves the programming experience of the TensorFlow machine learning library, and demonstrate the strong usability improvements with no loss in performance compared to native TensorFlow graphs.\end{abstract} View details
    Preview abstract When confronted with a substance of unknown identity, researchers often perform mass spectrometry on the sample and compare the observed spectrum to a library of previously collected spectra to identify the molecule. While popular, this approach will fail to identify molecules that are not in the existing library. In response, we propose to improve the library’s coverage by augmenting it with synthetic spectra that are predicted from candidate molecules using machine learning. We contribute a lightweight neural network model that quickly predicts mass spectra for small molecules, averaging 5 ms per molecule with a recall-at-10 accuracy of 91.8%. Achieving high-accuracy predictions requires a novel neural network architecture that is designed to capture typical fragmentation patterns from electron ionization. We analyze the effects of our modeling innovations on library matching performance and compare our models to prior machine-learning-based work on spectrum prediction. View details
    Preview abstract Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate $\sim1/3$ of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. To address this, we report a deep learning model that learns the relationship between unaligned amino acid sequences and their functional classification across all 17929 families of the PFam database. Using the Pfam seed sequences we establish a rigorous benchmark assessment and find that a dilated convolutional model reduces the error of state of the art BLASTp and pHMM models by a factor of nine. With 80\% of the full Pfam database we train a protein family predictor that is more accurate and over 200 times faster than BLASTp, while learning sequence features such as structural disorder and transmembrane helices. Our model co-locates sequences from unseen families in embedding space far from existing families, allowing sequences from novel families to be classified. We anticipate that deep learning models will be a core component of future general-purpose protein function prediction tools. View details
    Preview abstract Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive {\em uncertainty}. Quantifying uncertainty is especially critical in real-world settings, which often involve distributions that are skewed from the training distribution due to a variety of factors including sample bias and non-stationarity. In such settings, well calibrated uncertainty estimates convey information about when a model's output should (or should not) be trusted. Many probabilistic deep learning methods, including Bayesian-and non-Bayesian methods, have been proposed in the literature for quantifying predictive uncertainty, but to our knowledge there has not previously been a rigorous large-scale empirical comparison of these methods under conditions of distributional skew. We present a large-scale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of distributional skew on accuracy and calibration. We find that traditional post-hoc calibration falls short and some Bayesian methods are intractable for very large data. However, methods that marginalize over models give surprisingly strong results across a broad spectrum. View details
    BriarPatches: Pixel-Space Interventions for Inducing Demographic Parity
    Alexander Nicholas D'Amour
    Yoni Halpern
    Neural Information Processing Systems: Workshop on Ethical, Social and Governance Issues in AI (2018)
    Preview abstract We introduce the BriarPatch, a pixel-space intervention that obscures sensitive attributes from representations encoded in pre-trained classifiers. The patches encourage internal model representations not to encode sensitive information, which has the effect of pushing downstream predictors towards exhibiting demographic parity with respect to the sensitive information. The net result is that these BriarPatches provide an intervention mechanism available at user level, and complements prior research on fair representations that were previously only applicable by model developers and ML experts. View details
    Learning to count mosquitoes for the Sterile Insect Technique
    Yaniv Ovadia
    Yoni Halpern
    Dilip Krishnan
    Daniel Newburger
    Proceedings of the 23nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017)
    Preview abstract Mosquito-borne illnesses such as dengue, chikungunya, and Zika are major global health problems, which are not yet addressable with vaccines and must be countered by reducing mosquito popula- tions. The Sterile Insect Technique (SIT) is a promising alternative to pesticides; however, effective SIT relies on minimal releases of female insects. This paper describes a multi-objective convolutional neural net to significantly streamline the process of counting male and female mosquitoes released from a SIT factory and provides a statistical basis for verifying strict contamination rate limits from these counts despite measurement noise. These results are a promis- ing indication that such methods may dramatically reduce the cost of effective SIT methods in practice. View details
    Preview abstract Data cleaning and feature engineering are both common practices when developing machine learning (ML) models. However, developers are not always aware of best practices for preparing or transforming data for a given model type, which can lead to suboptimal representations of input features. To address this issue, we introduce the data linter, a new class of ML tool that automatically inspects ML data sets to 1) identify potential issues in the data and 2) suggest potentially useful feature transforms, for a given model type. As with traditional code linting, data linting automatically identifies potential issues or inefficiencies; codifies best practices and educates end-users about these practices through tool use; and can lead to quality improvements. In this paper, we provide a detailed description of data linting, describe our initial implementation of a data linter for deep neural networks, and report results suggesting the utility of using a data linter during ML model design. View details
    Bayesian Optimization for a Better Dessert
    Subhodeep Moitra
    Proceedings of the 2017 NIPS Workshop on Bayesian Optimization, December 9, 2017, Long Beach, USA (to appear)
    Preview abstract We present a case study on applying Bayesian Optimization to a complex real-world system; our challenge was to optimize chocolate chip cookies. The process was a mixed-initiative system where both human chefs, human raters, and a machine optimizer participated in 144 experiments. This process resulted in highly rated cookies that deviated from expectations in some surprising ways -- much less sugar in California, and cayenne in Pittsburgh. Our experience highlights the importance of incorporating domain expertise and the value of transfer learning approaches. View details
    Preview abstract Any sufficiently complex system acts as a black box when it becomes easier to experiment with than to understand. Hence, black-box optimization has become increasingly important as systems have become more complex. In this paper we describe Google Vizier, a Google-internal service for performing black-box optimization that has become the de facto parameter tuning engine at Google. Google Vizier is used to optimize many of our machine learning models and other systems, and also provides core capabilities to Google's Cloud Machine Learning HyperTune subsystem. We discuss our requirements, infrastructure design, underlying algorithms, and advanced features such as transfer learning and automated early stopping that the service provides. View details
    TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks
    Cassandra Xia
    Clemens Mewald
    George Roumpos
    Illia Polosukhin
    Jamie Alexander Smith
    Jianwei Xie
    Lichan Hong
    Mustafa Ispir
    Philip Daniel Tucker
    Yuan Tang
    Proceedings of the 23th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada (2017)
    Preview abstract We present a framework for specifying, training, evaluating, and deploying machine learning models. Our focus is to simplify writing cutting edge machine learning models in a way that enables bringing those models into production. Recognizing the fast evolution of the field of deep learning, we make no attempt to capture the design space of all possible model architectures in a DSL or similar configuration. We allow users to write code to define their models, but provide abstractions that guide developers to write models in ways conducive to productionization, as well as providing a unifying Estimator interface, a unified interface making it possible to write downstream infrastructure (distributed training, hyperparameter tuning, …) independent of the model implementation. We balance the competing demands for flexibility and simplicity by offering APIs at different levels of abstraction, making common model architectures available “out of the box”, while providing a library of utilities designed to speed up experimentation with model architectures. To make out of the box models flexible and usable across a wide range of problems, these canned Estimators are parameterized not only over traditional hyperparameters, but also using feature columns, a declarative specification describing how to interpret input data. We discuss our experience in using this framework in research and production environments, and show the impact on code health, maintainability, and development speed. View details
    Preview abstract Creating reliable, production-level machine learning systems brings on a host of concerns not found in small toy examples or even large offline research experiments. Testing and monitoring are key considerations for ensuring the production-readiness of an ML system, and for reducing technical debt of ML systems. But it can be difficult to formulate specific tests, given that the actual prediction behavior of any given model is difficult to specify a priori. In this paper, we present 28 specific tests and monitoring needs, drawn from experience with a wide range of production ML systems to help quantify these issues and present an easy to follow road-map to improve production readiness and pay down ML technical debt. View details
    Preview abstract Modern machine learning systems such as image classifers rely heavily on large scale data sets for training. Such data sets are costly to create, thus in practice a small number of freely available, open source data sets are widely used. Such strategies may be particularly important for ML applications in the developing world, where resources may be constrained and the cost of creating suitable large scale data sets may be a blocking factor. However, we suggest that examining the {\em geo-diversity} of open data sets is critical before adopting a data set for such use cases. In particular, we analyze two large, publicly available image data sets to assess geo-diversity and find that these data sets appear to exhibit a observable amerocentric and eurocentric representation bias. Further, we perform targeted analysis on classifiers that use these data sets as training data to assess the impact of these training distributions, and find strong differences in the relative performance on images from different locales. These results emphasize the need to ensure geo-representation when constructing data sets for use in the developing world. View details
    AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech
    Yannis Agiomyrgiannakis
    NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop (to appear)
    Preview abstract Developers of text-to-speech synthesizers (TTS) often make use of human raters to assess the quality of synthesized speech. We demonstrate that we can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform. Our best models provide utterance-level estimates of MOS only moderately inferior to sampled human ratings, as shown by Pearson and Spearman correlations. When multiple utterances are scored and averaged, a scenario common in synthesizer quality assessment, we achieve correlations comparable to those of human raters. This model has a number of applications, such as the ability to automatically explore the parameter space of a speech synthesizer without requiring a human-in-the-loop. We explore a method of probing what the models have learned. View details
    TensorFlow Debugger: Debugging Dataflow Graphs for Machine Learning
    Eric Nielsen
    Michael Salib
    Proceedings of the Reliable Machine Learning in the Wild - NIPS 2016 Workshop (2016)
    Preview abstract Debuggability is important in the development of machine-learning (ML) systems. Several widely-used ML libraries, such as TensorFlow and Theano, are based on dataflow graphs. While offering important benefits such as facilitating distributed training, the dataflow graph paradigm makes the debugging of model issues more challenging compared to debugging in the more conventional procedural paradigm. In this paper, we present the design of the TensorFlow Debugger (tfdbg), a specialized debugger for ML models written in TensorFlow. tfdbg provides features to inspect runtime dataflow graphs and the state of the intermediate graph elements ("tensors"), as well as simulating stepping on the graph. We will discuss the application of this debugger in development and testing use cases. View details
    What’s your ML test score? A rubric for ML production systems
    Eric Nielsen
    Michael Salib
    Reliable Machine Learning in the Wild - NIPS 2016 Workshop (2016)
    Preview abstract Using machine learning in real-world production systems is complicated by a host of issues not found in small toy examples or even large offline research experiments. Testing and monitoring are key considerations for assessing the production-readiness of an ML system. But how much testing and monitoring is enough? We present an ML Test Score rubric based on a set of actionable tests to help quantify these issues. View details
    Machine Learning: The High Interest Credit Card of Technical Debt
    Eugene Davydov
    Dietmar Ebner
    Vinay Chaudhary
    Michael Young
    SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop)
    Preview abstract Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns. View details
    Ad Click Prediction: a View from the Trenches
    Michael Young
    Dietmar Ebner
    Julian Grady
    Lan Nie
    Eugene Davydov
    Sharat Chikkerur
    Dan Liu
    Arnar Mar Hrafnkelsson
    Tom Boulos
    Jeremy Kubica
    Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2013)
    Preview abstract Predicting ad click--through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry. We present a selection of case studies and topics drawn from recent experiments in the setting of a deployed CTR prediction system. These include improvements in the context of traditional supervised learning based on an FTRL-Proximal online learning algorithm (which has excellent sparsity and convergence properties) and the use of per-coordinate learning rates. We also explore some of the challenges that arise in a real-world system that may appear at first to be outside the domain of traditional machine learning research. These include useful tricks for memory savings, methods for assessing and visualizing performance, practical methods for providing confidence estimates for predicted probabilities, calibration methods, and methods for automated management of features. Finally, we also detail several directions that did not turn out to be beneficial for us, despite promising results elsewhere in the literature. The goal of this paper is to highlight the close relationship between theoretical advances and practical engineering in this industrial setting, and to show the depth of challenges that appear when applying traditional machine learning methods in a complex dynamic system. View details
    Large-Scale Learning with Less RAM via Randomization
    Michael Young
    Proceedings of the 30 International Conference on Machine Learning (ICML) (2013), pp. 10
    Preview abstract We reduce the memory footprint of popular large-scale online learning methods by projecting our weight vector onto a coarse discrete set using randomized rounding. Compared to standard 32-bit float encodings, this reduces RAM usage by more than 50% during training and by up to 95% when making predictions from a fixed model, with almost no loss in accuracy. We also show that randomized counting can be used to implement per-coordinate learning rates, improving model quality with little additional RAM. We prove these memory-saving methods achieve regret guarantees similar to their exact variants. Empirical evaluation confirms excellent performance, dominating standard approaches across memory versus accuracy tradeoffs. View details
    Detecting Adversarial Advertisements in the Wild
    Michael Pohl
    Bridget Spitznagel
    John Hainsworth
    Yunkai Zhou
    Proceedings of the 17th ACM SIGKDD International Conference on Data Mining and Knowledge Discovery, KDD (2011)
    Preview abstract In a large online advertising system, adversaries may attempt to profit from the creation of low quality or harmful advertisements. In this paper, we present a large scale data mining effort that detects and blocks such adversarial advertisements for the benefit and safety of our users. Because both false positives and false negatives have high cost, our deployed system uses a tiered strategy combining automated and semi-automated methods to ensure reliable classification. We also employ strategies to address the challenges of learning from highly skewed data at scale, allocating the effort of human experts, leveraging domain expert knowledge, and independently assessing the effectiveness of our system. View details
    Going Mini: Extreme Lightweight Spam Filters
    Gordon V. Cormack
    CEAS 2009: Proceedings of the Sixth Conference on Email and Anti-Spam
    Predicting Bounce Rates in Sponsored Search Advertisements
    Robert Malkin
    Roberto J. Bayardo
    Proc. of the 15th International ACM-SIGKDD Conference on Knowledge Discovery and Data Mining, ACM (2009), pp. 1325-1334
    Large Scale Learning to Rank
    NIPS 2009 Workshop on Advances in Ranking
    Hidden Technical Debt in Machine Learning Systems
    Gary Holt
    Eugene Davydov
    Dietmar Ebner
    Vinay Chaudhary
    Michael Young
    Jean-François Crespo
    NIPS (2015), pp. 2503-2511