Abstract
With the increased frequency of critical headline updates and news information published on the web, it can be overwhelming for users to understand and verify articles in a larger context. In particular, crucial statistics or facts within a news story may be difficult to parse without comparison to other entities. Structured web tables, as found in a source such as Wikipedia, offer users an important resource for understanding the news. Displaying tables or charts alongside news documents can assist in comprehension of a document content. However, automatically finding relevant and meaningful connections between web tables to a given news article is a challenging research problem.
We are not aware of any research directly addressing this problem;
however, one can think of the news article as a (long) keyword query and apply information retrieval, or question-answering, techniques.
Previous work in this area used embeddings of KB entities and the utilization of different metrics for semantic similarity for table lookup.
Our experiments applying these baseline approaches in a straightforward way for this render spurious results that are inappropriate or irrelevant for a reader.
In this paper, we build on prior efforts, focusing specifically on the task of matching Wikipedia web tables to news articles. Our contribution includes a survey of existing techniques applied to the news to web matching problem. From these baselines, we propose a new model that leverages recent advances in Bidirectional transformer language models along with entity based table embeddings. Specifically our technique contains three technical components. First, we construct a training data set built from news article follow-up queries to Wikipedia articles over a large aggregate of users. Second, we extract unique web table based categories from Google's Knowledge Graph that describe Wikipedia table column components. Third, we fine-tune a Bidirectional Encoder Representations from Transformers (Bert), pre-trained on news corpus data.
Using human-based curators as a standard for creating an evaluation set, our approach significantly outperforms the baselines.