To be perfectly conversational, an agent needs to produce grammatically correct and eloquent sentences. To reach this goal, we use templatic systems with linguistically-aware specifications to generate idiomatic utterances, coupled with annotated lexical entities. The morphosyntactic features of the lexical entities are crucial to render grammatical and natural sounding sentences.
Existing electronic resources, like dictionaries or thesauri, lack wide-scale information about referential expressions (i.e. proper names). In this work, we focus on the creation of a large-scale lexicon of such referential expressions, relying on n-gram models, morpho-syntactic parsing, and non-linguistic knowledge. We describe the linguistic information we collect and the techniques we use to automatically extract this from large text corpora in a way that scales across languages and over millions of entities.