Tal Schuster

Tal Schuster

Tal Schuster is a Staff Research Scientist at Google AI working on Machine Learning and Natural Language Processing. He is developing robust and efficient models that leverage uncertainty-aware methods, and focusing on information-seeking applications.

For more details and full list of publications see his Personal Website and Google Scholar profile.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge. In this paper, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer while mixing between factual quoted spans---copied verbatim from given input sources---and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of our work for developing and studying such consolidation capabilities. View details
    Attribute First, then Generate: Locally-attributable Grounded Text Generation
    Aviv Slobodkin
    Eran Hirsch
    Arie Cattan
    Ido Dagan
    ACL(2024) (to appear)
    Preview abstract Recent efforts to address hallucinations in Large Language Models (LLMs) have focused on attributed text generation, which supplements generated texts with citations of supporting sources for post-generation fact-checking and corrections. Yet, these citations often point to entire documents or paragraphs, burdening users with extensive verification work. In this paper, we introduce a locally-attributable text generation approach, prioritizing concise attributions. Our method, named ``Attribute First, then Generate'', breaks down the conventional end-to-end generation process into three intuitive steps: content selection, sentence planning, and sequential sentence generation. By initially identifying relevant source segments (``select first'') and then conditioning the generation process on them (``then generate''), we ensure these segments also act as the output's fine-grained attributions (``select'' becomes ``attribute''). Tested on Multi-document Summarization and Long-form Question-answering, our method not only yields more concise citations than the baselines but also maintains - and in some cases enhances - both generation quality and attribution accuracy. Furthermore, it significantly reduces the time required for fact verification by human assessors. View details
    Conformal Risk Control
    Anastasios N. Angelopoulos
    Stephen Bates
    Adam Fisch
    Lihua Lei
    ICLR(2024)
    Preview abstract We extend conformal prediction to control the expected value of any monotone loss function. The algorithm generalizes split conformal prediction together with its coverage guarantee. Like conformal prediction, the conformal risk control procedure is tight up to an O(1/n) factor. Worked examples from computer vision and natural language processing demonstrate the usage of our algorithm to bound the false negative rate, graph distance, and token-level F1-score. View details
    Conformal Language Modeling
    Victor Quach
    Adam Fisch
    Adam Yala
    Jae Ho Sohn
    Tommi Jaakkola
    Regina Barzilay
    ICLR(2024)
    Preview abstract In this paper, we propose a novel approach to conformal prediction (CP) that is adapted to generative, large language models (LLMs). Conformal prediction is a popular technique for deriving prediction sets from machine learning models that have rigorous, statistical performance guarantees. We extend conformal techniques to a broad class of language models that sample from a conditional distribution over the combinatorial, unbounded space of possible text outputs, given some input prompt. Specifically, we translate the process of constructing prediction sets into calibrating a \emph{stopping rule}, under which we draw diverse samples from our model until we are confident that the growing set of candidate answers includes at least one high-quality response. At the same time, we calibrate a \emph{rejection rule} to selectively discard low-quality or redundant responses to reduce sample noise. Under minimal assumptions, we theoretically prove that our resulting output sets contain at least one high-quality answer with some desired probability that a user can set (such as $90\%$), while still remaining empirically precise on average. Furthermore, within this set of sampled candidate answers, we show that we can also accurately identify subsets of individual components (e.g., phrases or sentences) that are each independently correct (e.g., that are not ``hallucinations'')---again, with provably high probability. We demonstrate the effectiveness of our approach on multiple types of large language models applied to tasks in open-domain question answering, text summarization, and radiology report generation. View details
    Preview abstract The widely studied task of Natural Language Inference (NLI) requires a system to recognize whether one piece of text is textually entailed by another, i.e. whether the entirety of its meaning can be inferred from the other. In current NLI corpora and models, the textual entailment relation is typically defined on the sentence- or paragraph- level. However, even a simple sentence often contains multiple propositions, i.e. distinct units of meaning conveyed by the sentence. These propositions can carry different truth values in the context of a given premise, and we argue for the need to identify such fine-grained textual entailment relations. To facilitate the study on proposition-level segmentation and entailment, we propose PropSegmEnt, a corpus of over 35K propositions annotated by trained expert annotators. Our dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document, i.e. documents describing the same event or entity. We establish strong baselines for the segmentation and entailment tasks. We demonstrate that our conceptual framework is potentially useful for understanding and explaining the compositionality of NLI labels. View details
    Preview abstract Transformer encoders contextualize token representations by attending to all other tokens at each layer, leading to quadratic increase in compute effort with the input length. In practice, however, the input text of many NLP tasks can be seen as a sequence of related segments (e.g., the sequence of sentences within a passage, or the hypothesis and premise in NLI). While attending across these segments is highly beneficial for many tasks, we hypothesize that this interaction can be delayed until later encoding stages. To this end, we introduce Layer-adjustable Interactions in Transformers (LAIT). Within LAIT, segmented inputs are first encoded independently, and then jointly. This partial two-tower architecture bridges the gap between a Dual Encoder's ability to pre-compute representations for segments and a fully self-attentive Transformer's capacity to model cross-segment attention. Also, LAIT can be introduced only when finetuning, effectively converting an existing pretrained Transformer into the hybrid of the two aforementioned architectures, and providing an intuitive control over the performance-efficiency tradeoff. Experimenting on a wide range of NLP tasks, we find LAIT to significantly improve efficiency while preserving accuracy. View details
    Preview abstract Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives – two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pretraining objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on oneshot summarization. Finally, we show that UL2 20B works well with chain-ofthought prompting and reasoning tasks, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. We publicly release Flax-based T5X model checkpoints for the 20B model. View details
    Preview abstract Social and behavioral determinants of health (SDOH) play a significant role in shaping health outcomes, and extracting these determinants from clinical notes is a first step to help healthcare providers systematically identify opportunities to provide appropriate care and address disparities. Progress on using NLP methods for this task has been hindered by the lack of high-quality public datasets, largely due to the privacy and regulatory constraints on the use of real patient data. This paper introduces a new dataset, SDOH-NLI, that is based on fully public data. We formulate SDOH extraction as a natural language inference task, and provide binary textual entailment labels obtained from human raters for a cross product of a set of social history snippets as premises and SDOH factors as hypotheses. Our dataset differs from standard NLI benchmark in that our premises and hypotheses are obtained independently. We evaluate both "off-the-shelf" entailment models as well as models fine-tuned on our data, and highlight the ways in which our dataset appears more challenging than commonly used NLI datasets. View details
    Preview abstract The predictions of question answering (QA) systems are typically evaluated against manually annotated finite sets of one or more answers. This leads to a coverage limitation that results in underestimating the true performance of systems, and is typically addressed by extending over exact match (EM) with predefined rules or with the token-level F1 measure. In this paper, we present the first systematic conceptual and data-driven analysis to examine the shortcomings of token-level equivalence measures. To this end, we define the asymmetric notion of answer equivalence (AE), accepting answers that are equivalent to or improve over the reference, and publish over 23k human judgments for candidates produced by multiple QA systems on SQuAD. Through a careful analysis of this data, we reveal and quantify several concrete limitations of the F1 measure, such as a false impression of graduality, or missing dependence on the question. Since collecting AE annotations for each evaluated model is expensive, we learn a BERT matching (BEM) measure to approximate this task. Being a simpler task than QA, we find BEM to provide significantly better AE approximations than F1, and to more accurately reflect the performance of systems. Finally, we demonstrate the practical utility of AE and BEM on the concrete application of minimal accurate prediction sets, reducing the number of required answers by up to ×2.6. View details
    Preview abstract Accurate estimates of uncertainty are important for many difficult or sensitive prediction tasks in natural language processing (NLP). Though large-scale pre-trained models have vastly improved the accuracy of applied machine learning models throughout the field, there still are many instances in which they fail. The ability to precisely quantify uncertainty while handling the challenging scenarios that modern models can face when deployed in the real world is critical for reliable, consequential-decision making. This tutorial is intended for both academic researchers and industry practitioners alike, and provides a comprehensive introduction to uncertainty estimation for NLP problems---from fundamentals in probability calibration, Bayesian inference, and confidence set (or interval) construction, to applied topics in modern out-of-distribution detection and selective inference. Tutorial website View details