| Yet in each word some concept there must be... |
| — from Goethe's Faust (Part I, Scene III) |
Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google's core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.
How do we represent concepts? Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual
Wikipedia articleas representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using
Wikipedia's groupings of articles into hierarchical categories.
The data set contains triples, each consisting of (i)
text, a short, raw natural language string; (ii)
url, a related concept, represented by an
English Wikipedia article's canonical location; and (iii)
count, an integer indicating the number of times
text has been observed connected with the concept's
url. Our database thus includes weights that measure degrees of association. For example, the top two entries for
football indicate that it is an ambiguous term, which is almost twice as likely to refer to what we in the US call
soccer:
An inverted index can be used to perform reverse look-ups, identifying salient terms for each concept. Some of the highest-scoring strings — including synonyms and translations — for both sports, are listed below:
| concept: | | “soccer” | | football and Football | | Soccer and soccer | | Association football | | fútbol and Fútbol | | footballer | | Futbol and futbol | | Fußball | | futebol | | futbolista |
| | | サッカー | | 축구 | | footballeur | | Fußballspieler | | sepak bola | | 足球 | | فوتبال | | футболист | | כדורגל |
| | | piłkarz | | voetbalclub | | ฟุตบอล | | bóng đá | | voetbal | | Foutbaal | | futebolista | | لعبة كرة القدم | | fotbal |
|
| | concept: | | “football” | | American football | | football and Football | | fútbol americano | | football américain | | アメリカンフットボール | | American football rules | | futebol americano | | فوتبال آمریکایی | | 美式足球 |
| | | football americano | | Amerikan futbolu | | Le Football Américain | | football field | | อเมริกันฟุตบอล | | פוטבול | | كرة القدم الأمريكية | | Futbol amerykański | | 미식축구 |
| | | futbolu amerykańskiego | | football team | | американского футбола | | Amerikai futball | | sepak bola Amerika | | football player | | američki fudbal | | 反則 | | كرة القدم الأميركية |
|
|
Associated counts can easily be turned into percentages. The following table illustrates the concept-to-words dictionary direction — which may be useful for paraphrasing, summarization and topic modeling — for the idea of
soft drink, restricted to English (and normalized for punctuation, pluralization and capitalization differences):
| url=Soft_drink | text | | % |
| 1. | soft drink | (and soft-drinks) | 28.6 |
| 2. | soda | (and sodas) | 5.5 |
| 3. | soda pop | | 0.9 |
| 4. | fizzy drinks | | 0.6 |
| 5. | carbonated beverages | (and beverage) | 0.3 |
| 6. | non-alcoholic | | 0.2 |
| 7. | soft | | 0.1 |
| 8. | pop | | 0.1 |
| 9. | carbonated soft drink | (and drinks) | 0.1 |
| 10. | aerated water | | 0.1 |
| 11. | non-alcoholic drinks | (and drink) | 0.1 |
| 12. | soft drink controversy | | 0.0 |
| 13. | citrus-flavored soda | | 0.0 |
| 14. | carbonated | | 0.0 |
| 15. | soft drink topics | | 0.0 |
| ⋮ | | |
The words-to-concepts dictionary direction can disambiguate senses and link entities, which are often highly ambiguous, since people, places and organizations can (nearly) all be named after each other. The next table shows the top concepts meant by the string
Stanford, which refers to all three (and other) types:
The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing
non-existent articles. For technical details, see our
paper(to be
presented at
LREC 2012) and the
README file accompanying the
data.
We hope that
this release will fuel numerous creative applications that haven't been previously thought of!
Produced by Angel X. Changand Valentin I. Spitkovsky; parts of this work are descended from an earlier collaboration between University of Basque Country's Ixa Group's Eneko Agirreand Stanford's NLP Group, including Eric Yeh, presently of SRI International, and our Ph.D. advisors, Christopher D. Manningand Daniel Jurafsky.