Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

Sameer Singh; Amarnag Subramanya; Fernando Pereira; Andrew McCallum

Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

Sameer Singh

Amarnag Subramanya

Fernando Pereira

Andrew McCallum

Association for Computational Linguistics (ACL) (2011)

Google Scholar

Abstract

Cross-document coreference, the task of grouping all the mentions of each entity in a document collection, arises in information extraction and automated knowledge base construction. For large collections, it is clearly
impractical to consider all possible groupings of mentions into distinct entities. To solve the problem we propose two ideas: (a) a distributed inference technique that uses parallelism to enable large scale processing, and (b) a hierarchical model of coreference that represents uncertainty over multiple granularities of entities to facilitate more effective approximate inference. To evaluate these ideas, we constructed a labeled corpus of 1:5 million disambiguated mentions in Web pages by selecting link anchors referring to Wikipedia entities. We show that the combination of the
hierarchical model with distributed inference quickly obtains high accuracy (with error reduction of 38%) on this large dataset, demonstrating the scalability of our approach.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs