50,000 Lessons on How to Read: a Relation Extraction Corpus

April 11, 2013

Posted by Dave Orr, Product Manager, Google Research

One of the most difficult tasks in NLP is called relation extraction. It’s an example of information extraction, one of the goals of natural language understanding. A relation is a semantic connection between (at least) two entities. For instance, you could say that Jim Henson was in a spouse relation with Jane Henson (and in a creator relation with many beloved characters and shows).

The goal of relation extraction is to learn relations from unstructured natural language text. The relations can be used to answer questions (“Who created Kermit?”), learn which proteins interact in the biomedical literature, or to build a database of hundreds of millions of entities and billions of relations to try and help people explore the world’s information.

To help researchers investigate relation extraction, we’re releasing a human-judged dataset of two relations about public figures on Wikipedia: nearly 10,000 examples of “place of birth”, and over 40,000 examples of “attended or graduated from an institution”. Each of these was judged by at least 5 raters, and can be used to train or evaluate relation extraction systems. We also plan to release more relations of new types in the coming months. (Update: you can find additional relations here.)

Each relation is in the form of a triple: the relation in question, called a predicate; the subject of the relation; and the object of the relation. In the relation “Stephen Hawking graduated from Oxford,” Stephen Hawking is the subject, graduated from is the relation, and Oxford University is the object. Subjects and objects are represented by their Freebase MID’s, and the relation is defined as a Freebase property. So in this case, the triple would be represented as:


Just having the triples is interesting enough if you want a database of entities and relations, but doesn’t make much progress towards training or evaluation a relation extraction system. So we’ve also included the evidence for the relation, in the form of a URL and an excerpt from the web page that our raters judged. We’re also including examples where the evidence does not support the relation, so you have negative examples for use in training better extraction systems. Finally, we included ID’s and actual judgments of individual raters, so that you can filter triples by agreement.

Gory Details

The corpus itself, extracted from Wikipedia, can be found here: https://code.google.com/p/relation-extraction-corpus/

The files are in JSON format. Each line is a triple with the following fields:

  • pred: predicate of a triple
  • sub: subject of a triple
  • obj: object of a triple
  • evidences: an array of evidences for this triple
    • url: the web page from which this evidence was obtained
    • snippet: short piece of text supporting the triple
  • judgments: an array of judgements from human annotators
    • rator: hash code of the identity of the annotator
    • judgment: judgement of the annotator. It can take the values "yes" or "no"

Here’s an example:

{"pred":"/people/person/place_of_birth","sub":"/m/026_tl9","obj":"/m/02_286","evidences":[{"url":"http://en.wikipedia.org/wiki/Morris_S._Miller","snippet":"Morris Smith Miller (July 31, 1779 -- November 16, 1824) was a United States Representative from New York. Born in New York City, he graduated from Union College in Schenectady in 1798. He studied law and was admitted to the bar. Miller served as private secretary to Governor Jay, and subsequently, in 1806, commenced the practice of his profession in Utica. He was president of the village of Utica in 1808 and judge of the court of common pleas of Oneida County from 1810 until his death."}],"judgments":[{"rater":"11595942516201422884","judgment":"yes"},{"rater":"16169597761094238409","judgment":"yes"},{"rater":"1014448455121957356","judgment":"yes"},{"rater":"16651790297630307764","judgment":"yes"},{"rater":"1855142007844680025","judgment":"yes"}]}

The web is chock full of information, put there to be read and learned from. Our hope is that this corpus is a small step towards computational understanding of the wealth of relations to be found everywhere you look.

This dataset is licensed by Google Inc. under the Creative Commons Attribution-Sharealike 3.0 license.

Thanks to Shaohua Sun, Ni Lao, and Rahul Gupta for putting this dataset together.

Thanks also to Michael Ringgaard, Fernando Pereira, Amar Subramanya, Evgeniy Gabrilovich, and John Giannandrea for making this data release possible.