Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion
Abstract
Recent years have witnessed a proliferation of large-scale
knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft’s
Satori, and Google’s Knowledge Graph. To increase
the scale even further, we need to explore automatic
methods for constructing knowledge bases. Previous approaches
have primarily focused on text-based extraction,
which can be very noisy. Here we introduce Knowledge
Vault, a Web-scale probabilistic knowledge base that combines
extractions from Web content (obtained via analysis of
text, tabular data, page structure, and human annotations)
with prior knowledge derived from existing knowledge repositories.
We employ supervised machine learning methods for
fusing these distinct information sources. The Knowledge
Vault is substantially bigger than any previously published
structured knowledge repository, and features a probabilistic
inference system that computes calibrated probabilities
of fact correctness. We report the results of multiple studies
that explore the relative utility of the different information
sources and extraction methods.
knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft’s
Satori, and Google’s Knowledge Graph. To increase
the scale even further, we need to explore automatic
methods for constructing knowledge bases. Previous approaches
have primarily focused on text-based extraction,
which can be very noisy. Here we introduce Knowledge
Vault, a Web-scale probabilistic knowledge base that combines
extractions from Web content (obtained via analysis of
text, tabular data, page structure, and human annotations)
with prior knowledge derived from existing knowledge repositories.
We employ supervised machine learning methods for
fusing these distinct information sources. The Knowledge
Vault is substantially bigger than any previously published
structured knowledge repository, and features a probabilistic
inference system that computes calibrated probabilities
of fact correctness. We report the results of multiple studies
that explore the relative utility of the different information
sources and extraction methods.