Google Research

PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and Entailment Recognition

Findings of ACL 2023 (to appear)


The widely studied task of Natural Language Inference (NLI) requires a system to recognize whether one piece of text is textually entailed by another, i.e. whether the entirety of its meaning can be inferred from the other. In current NLI corpora and models, the textual entailment relation is typically defined on the sentence- or paragraph- level. However, even a simple sentence often contains multiple propositions, i.e. distinct units of meaning conveyed by the sentence. These propositions can carry different truth values in the context of a given premise, and we argue for the need to identify such fine-grained textual entailment relations.

To facilitate the study on proposition-level segmentation and entailment, we propose PropSegmEnt, a corpus of over 35K propositions annotated by trained expert annotators. Our dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document, i.e. documents describing the same event or entity. We establish strong baselines for the segmentation and entailment tasks. We demonstrate that our conceptual framework is potentially useful for understanding and explaining the compositionality of NLI labels.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work