WHAD: Wikipedia historical attributes data
Abstract
This paper describes the generation of temporally anchored infobox
attribute data from the Wikipedia history of revisions. By mining (attribute, value)
pairs from the revision history of the English Wikipedia we are able to collect a
comprehensive knowledge base that contains data on how attributes change over
time. When dealing with the Wikipedia edit history, vandalic and erroneous edits
are a concern for data quality. We present a study of vandalism identification in
Wikipedia edits that uses only features from the infoboxes, and show that we can
obtain, on this dataset, an accuracy comparable to a state-of-the-art vandalism
identification method that is based on the whole article. Finally, we discuss different
characteristics of the extracted dataset, which we make available for further study.
attribute data from the Wikipedia history of revisions. By mining (attribute, value)
pairs from the revision history of the English Wikipedia we are able to collect a
comprehensive knowledge base that contains data on how attributes change over
time. When dealing with the Wikipedia edit history, vandalic and erroneous edits
are a concern for data quality. We present a study of vandalism identification in
Wikipedia edits that uses only features from the infoboxes, and show that we can
obtain, on this dataset, an accuracy comparable to a state-of-the-art vandalism
identification method that is based on the whole article. Finally, we discuss different
characteristics of the extracted dataset, which we make available for further study.