Collocating Structured Web Tables with News Articles
Abstract
With the increased frequency of critical headline updates and news information published on the web, it can be overwhelming for users to understand and verify articles in a larger context. In particular, crucial statistics or facts within a news story may be difficult to parse without comparison to other entities. Structured web tables, as found in a source such as Wikipedia, offer users an important resource for understanding the news. Displaying tables or charts alongside news documents can assist in comprehension of a document content. However, automatically finding relevant and meaningful connections between web tables to a given news article is a challenging research problem.
We are not aware of any research directly addressing this problem;
however, one can think of the news article as a (long) keyword query and apply information retrieval, or question-answering, techniques.
Previous work in this area used embeddings of KB entities and the utilization of different metrics for semantic similarity for table lookup.
Our experiments applying these baseline approaches in a straightforward way for this render spurious results that are inappropriate or irrelevant for a reader.
In this paper, we build on prior efforts, focusing specifically on the task of matching Wikipedia web tables to news articles. Our contribution includes a survey of existing techniques applied to the news to web matching problem. From these baselines, we propose a new model that leverages recent advances in Bidirectional transformer language models along with entity based table embeddings. Specifically our technique contains three technical components. First, we construct a training data set built from news article follow-up queries to Wikipedia articles over a large aggregate of users. Second, we extract unique web table based categories from Google's Knowledge Graph that describe Wikipedia table column components. Third, we fine-tune a Bidirectional Encoder Representations from Transformers (Bert), pre-trained on news corpus data.
Using human-based curators as a standard for creating an evaluation set, our approach significantly outperforms the baselines.
We are not aware of any research directly addressing this problem;
however, one can think of the news article as a (long) keyword query and apply information retrieval, or question-answering, techniques.
Previous work in this area used embeddings of KB entities and the utilization of different metrics for semantic similarity for table lookup.
Our experiments applying these baseline approaches in a straightforward way for this render spurious results that are inappropriate or irrelevant for a reader.
In this paper, we build on prior efforts, focusing specifically on the task of matching Wikipedia web tables to news articles. Our contribution includes a survey of existing techniques applied to the news to web matching problem. From these baselines, we propose a new model that leverages recent advances in Bidirectional transformer language models along with entity based table embeddings. Specifically our technique contains three technical components. First, we construct a training data set built from news article follow-up queries to Wikipedia articles over a large aggregate of users. Second, we extract unique web table based categories from Google's Knowledge Graph that describe Wikipedia table column components. Third, we fine-tune a Bidirectional Encoder Representations from Transformers (Bert), pre-trained on news corpus data.
Using human-based curators as a standard for creating an evaluation set, our approach significantly outperforms the baselines.