Chris Welty

Chris Welty

Dr. Chris Welty is a Sr. Research Scientist at Google in New York. His main area of interest is the interaction between structured knowledge (e.g. knowledge graphs such as freebase), unstructured knowledge (e.g. natural language text), and human knowledge (e.g. crowdsourcing). His latest work focuses on understanding the continuous nature of truth in the presence of a diversity of perspectives, and he has been working with the google maps team to better understand user contributions that often disagree. He is most active in the Crowdsourcing and Human Computation community, as well as The Web Conf, AKBC, Information and Knowledge Management, and AAAI.

His first project at Google was launched as Explore in Google Docs, and then on improving the quality and expanding the coverage of price level labels on maps using user signals. Before Google, Dr. Welty was a member of the technical leadership team for IBM's Watson - the question answering computer that defeated the all-time best Jeopardy! champions in a widely televised contest. He appeared on the broadcast, discussing the technology behind Watson, as well as many articles in the popular and scientific press. His proudest moment was being interviewed for StarTrek.com about the project. He is a recipient of the AAAI Feigenbaum Prize for his work.

Welty has played a seminal role in the development of the Semantic Web and Ontologies, and co-developed OntoClean, the first formal methodology for evaluating ontologies. He is on the editorial board of AI Magazine, the Journal of Applied Ontology, the Journal of Web Semantics, and the Semantic Web Journal. He is currently an editor for the AI Magazine column, "AI Bookies" to foster science bets on the progress of AI. He published many papers before those shown below, see his Google Scholar entry.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We tackle the problem of providing accurate, rigorous p-values for comparisons between the results of two evaluated systems whose evaluations are based on a crowdsourced “gold” reference standard. While this problem has been studied before, we argue that the null hypotheses used in previous work have been based on a common fallacy of equality of probabilities, as opposed to the standard null hypothesis that two sets are drawn from the same distribution. We propose using the standard null hypothesis, that two systems’ responses are drawn from the same distribution, and introduce a simulation-based framework for determining the true p-value for this null hypothesis. We explore how to estimate the true p-value from a single test set under different metrics, tests, and sampling methods, and call particular attention to the role of response variance, which exists in crowdsourced annotations as a product of genuine disagreement, and in system predictions as a product of stochastic training regimes, or in generative models as an expected property of the outputs. We find that response variance is a powerful tool for estimating p-values, and present results for the metrics, tests, and sampling methods that make the best p-value estimates in a simple machine learning model comparison View details
    Annotator Response Distributions as a Sampling Frame
    Christopher Homan
    LREC WOrkshop on Perspectivist NLP(2022)
    Preview abstract Annotator disagreement is often dismissed as noise or the result of poor annotation process quality. Others have argued that it can be meaningful. But lacking a rigorous statistical foundation, the analysis of disagreement patterns can resemble a high-tech form of tea-leaf-reading. We contribute a framework for analyzing the variation of per-item annotator response distributions to data for humans-in-the-loop machine learning. We provide visualizations for, and use the framework to analyze the variance in, a crowdsourced dataset of hard-to-classify examples of the OpenImages archive. View details
    Preview abstract Search engines including Google are beginning to support local-dining queries such as ``At which nearby restaurants can I order the Indonesian salad \textit{gado-gado}?''. Given the low coverage of online menus worldwide, and only 30\% even having a website, this remains a challenge. Here we leverage the power of the crowd: online users who are willing to answer questions about dish availability at restaurants visited. While motivated users are happy to contribute knowledge for free, they are much less likely to respond to ``silly'' or embarrassing questions (e.g., ``Does \textit{Pizza Hut} serve pizza?'' or ``Does \textit{Mike's Vegan Restaurant} serve hamburgers?'') In this paper, we study the problem of \textit{Vexation-Aware Active Learning}, where judiciously selected questions are targeted towards improving restaurant-dish model prediction, subject to a limit on the percentage of ``unsure'' answers or ``dismissals'' (e.g., swiping the app closed) used to measure vexation. We formalize the problem as an integer linear program and solve it efficiently using a distributed solution that scales linearly with the number of candidate questions. Since our algorithm relies on precise estimation of the unsure-dismiss rate (UDR), we give a regression model that provides accurate results compared to baselines including collaborative filtering. Finally, we demonstrate in a live system that our proposed vexation-aware strategy performs competitively against classical (margin-based) active learning approaches while not exceeding UDR bounds. View details
    Preview abstract Successful knowledge graphs (KGs) solved the historical knowledge acquisition bottleneck by supplanting an expert focus with a simple, crowd-friendly one: KG nodes represent popular people, places, organizations, etc., and the graph arcs represent common sense relations like affiliations, locations, etc. Techniques for more general, categorical, KG curation do not seem to have made the same transition: the KG research community is still largely focused on logic-based methods that belie the common-sense characteristics of successful KGs. In this paper, we propose a simple yet novel approach to acquiring \emph{class-level attributes} from the crowd that represent broad common sense associations between categories, and can be used with the classic knowledge-base default \& override technique (e.g. \cite{reiter1978}) to address the early \textit{label sparsity problem} faced by machine learning systems for problems that lack data for training. We demonstrate the effectiveness of our acquisition and reasoning approach on a pair of very real industrial-scale problems: how to augment an existing KG of places and offerings (e.g. stores and products, restaurants and dishes) with associations between them indicating the availability of the offerings at those places, which would enable the KG to provide answers to questions like, ``Where can I buy milk nearby?'' This problem has several practical challenges but for this paper we focus mostly on the label sparsity. Less than 30\% of physical places worldwide (i.e. brick \& mortar stores and restaurants) have a website, and less than half of those list their product catalog or menus, leaving a large acquisition gap to be filled by methods other than information extraction (IE). Label sparsity is a general problem, and not specific to these use cases, that prevents modern AI and machine learning techniques from applying to many applications for which labeled data is not readily available. As a result, the study of how to acquire the knowledge and data needed for AI to work is as much a problem today as it was in the 1970s and 80s during the advent of expert systems \cite{mycin1975}. The class-level attributes approach presented here is based on a KG-inspired intuition that a lot of the knowledge people need to understand where to go to buy a product they need, or where to find the dishes they want to eat, is categorical and part of their general common sense: everyone knows grocery stores sell milk and don't sell asphalt, chinese restaurants serve fried rice and not hamburgers, etc. We acquired a mixture of instance- and class- level pairs (e.g. $\langle$\textit{Ajay Mittal Dairy}, milk$\rangle$, $\langle$GroceryStore, milk$\rangle$, resp.) from a novel 3-tier crowdsourcing method, and demonstrate the scalability advantages of the class-level approach. Our results show that crowdsourced class-level knowledge can provide rapid scaling of knowledge acquisition in shopping and dining domains. The acquired common sense knowledge also has long-term value in the KG. The approach was a critical part of enabling a worldwide \textit{local search} capability on Google Maps, with which users can find products and dishes that are available in most places on earth. View details
    Preview abstract Successful knowledge graphs (KGs) solved the historical knowledge acquisition bottleneck by supplanting an expert focus with a simple, crowd-friendly one: KG nodes represent popular people, places, organizations, etc., and the graph arcs represent common sense relations like affiliations, locations, etc. Techniques for more general, categorical, KG curation do not seem to have made the same transition: the KG research community is still largely focused on methods that belie the common-sense characteristics of successful KGs. In this paper, we propose a simple approach to acquiring and reasoning with class-level attributes from the crowd that represent broad common sense associations between categories. We pick a very real industrial-scale data set and problem: how to augment an existing knowledge graph of places and products with associations between them indicating the availability of the products at those places, which would enable a KG to provide answers to questions like, ``Where can I buy milk nearby?'' This problem has several practical challenges, not least of which is that only 30\% of physical stores (i.e. brick \& mortar stores) have a website, and fewer list their product inventory, leaving a large acquisition gap to be filled by methods other than information extraction (IE). Based on a KG-inspired intuition that a lot of the class-level pairs are part of people's general common sense, e.g. everyone knows grocery stores sell milk and don't sell asphalt, we acquired a mixture of instance- and class- level pairs (e.g. {Ajay Mittal Dairy, milk}, {GroceryStore, milk}, resp.) from a novel 3-tier crowdsourcing method, and demonstrate the scalability advantages of the class-level approach. Our results show that crowdsourced class-level knowledge can provide rapid scaling of knowledge acquisition in this and similar domains, as well as long-term value in the KG. View details
    Empirical methodology for crowdsourcing ground truth
    Anca Dumitrache
    Benjamin Timmermans
    Oana Inel
    Semantic Web Journal, 12:3; 2021(2021)
    Preview abstract The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods for populating the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, in many domains, such as event detection, there is ambiguity in the data, as well as a multitude of perspectives of the information examples. We present an empirically derived methodology for efficiently gathering of ground truth data in a diverse set of use cases covering a variety of domains and annotation tasks. Central to our approach is the use of CrowdTruth metrics that capture inter-annotator disagreement. We show that measuring disagreement is essential for acquiring a high quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical Relation Extraction, Twitter Event Identification, News Event Extraction and Sound Interpretation. We also show that an increased number of crowd workers leads to growth and stabilization in the quality of annotations, going against the usual practice of employing a small number of annotators. View details
    AI Bookie: Betting on Bets
    Kurt Bollacker
    Praveen Kumar Paritosh
    AI Magazine, 42(3), Fall 2021(2021)
    Preview abstract The AI bookies have spent a lot of time and energy collecting bets from AI researchers, and have met with universal approval of the idea of scientific betting, and nearly universal silence in the acquisition of bets. We have collected a few in this column over the past two years, in the first column we published the “will voice interfaces become the standard” bet, as well as a set of 10 predictions from Eric Horvitz that we proposed as bets awaiting challengers. No challengers have emerged. In this article we review the methods we've used to collect bets and conclude that people need ideas for bets to make. We propose five new bets and solicit participants in them. View details
    Embedding Semantic Taxonomies
    Alyssa Whitlock Lees
    Jacek Korycki
    Sara Mc Carthy
    CoLing 2020
    Preview abstract A common step in developing an understanding of a vertical domain, e.g. shopping, dining, movies, medicine, etc., is curating a taxonomy of categories specific to the domain. These human created artifacts have been the subject of research in embeddings that attempt to encode aspects of the partial ordering property of taxonomies. We compare Box Embeddings, a natural containment representation of category taxonomies, to partial-order embeddings and a baseline Bayes Net, in the context of representing the Medical Subject Headings (MeSH) taxonomy given a set of 300K PubMed articles with subject labels from MeSH. We deeply explore the experimental properties of training box embeddings, including preparation of the training data, sampling ratios and class balance, initialization strategies, and propose a fix to the original box objective. We then present first results in using these techniques for representing a bipartite learning problem (i.e. collaborative filtering) in the presence of taxonomic relations within each partition, inferring disease (anatomical) locations from their use as subject labels in journal articles. Our box model substantially outperforms all baselines for taxonomic reconstruction and bipartite relationship experiments. This performance improvement is observed both in overall accuracy and the weighted spread by true taxonomic depth. View details
    Discovering User Bias in Ordinal Voting Systems
    Alyssa Whitlock Lees
    SAD-2019: Workshop on Subjectivity, Ambiguity and Disagreement
    Preview abstract Crowdsourcing systems increasingly rely on users to provide more subjective ground truth for intelligent systems - e.g. ratings, aspect of quality and perspectives on how expensive or lively a place feels, etc. We focus on the ubiquitous implementation of online user ordinal voting (e.g 1-5, 1 star-4 stars) on some aspect of an entity, to extract a relative truth, measured by a selected metric such as vote plurality or mean. We argue that this methodology can aggregate results that yield little information to the end user. In particular, ordinal user rankings often converge to a indistinguishable rating. This is demonstrated by the trend in certain cities for the majority of restaurants to all have a 4 star rating. Similarly, the rating of an establishment can be significantly affected by a few users. User bias in voting is not spam, but rather a preference that can be harnessed to provide more information to users. We explore notions of both global skew and user bias. Leveraging these bias and preference concepts, the paper suggests explicit models for better personalization and more informative ratings. View details
    What is Fair? Exploring Pareto-Efficiency for Fairness Constraint Classifiers
    Alyssa Whitlock Lees
    Ananth Balashankar
    Lakshminarayanan Subramanian
    arxiv(2019), pp. 10
    Preview abstract The potential for learned models to amplify existing societal biases has been broadly recognized. Fairness-aware classifier constraints, which apply equality metrics of performance across subgroups defined on sensitive attributes such as race and gender, seek to rectify inequity but can yield non-uniform degradation in performance for skewed datasets. In certain domains, imbalanced degradation of performance can yield another form of unintentional bias. In the spirit of constructing fairness-aware algorithms as societal imperative, we explore an alternative: Pareto-Efficient Fairness (PEF). PEF identifies the operating point on the Pareto curve of subgroup performances closest to the fairness hyperplane, maximizing multiple subgroup accuracies. Empirically we demonstrate that PEF increases performance of all subgroups in several UCI datasets. View details