Scalable Attribute-Value Extraction from Semi-Structured Text

Yuk Wah Wong; Dominic Widdows; Tom Lokovic; Kamal Nigam

Scalable Attribute-Value Extraction from Semi-Structured Text

Yuk Wah Wong

Dominic Widdows

Tom Lokovic

Kamal Nigam

ICDM Workshop on Large-scale Data Mining: Theory and Applications (2009)

Download Google Scholar

Abstract

This paper describes a general methodology for extracting attribute-value pairs from web pages. It consists of two phases: candidate generation, in which syntactically likely attribute-value pairs are annotated; and candidate filtering, in which semantically improbable annotations are removed. We describe three types of candidate generators and two types of candidate filters, all of which are designed to be massively parallelizable. Our methods can handle 1 billion web pages in less than 6 hours with 1,000 machines. The best generator and filter combination achieves 70% F-measure compared to a hand-annotated corpus.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Scalable Attribute-Value Extraction from Semi-Structured Text

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs