Vitaly Nikolaev
Research Areas
Authored Publications
Sort By
Measuring Attribution in Natural Language Generation Models
Iulia Turc
Computational Linguistics, 49 (2023), pp. 777-840
Preview abstract
With recent improvements in natural language generation (NLG) models for various applications, it has become imperative to have the means to identify and evaluate whether NLG output is only sharing verifiable information about the external world. In this work, we present a new evaluation framework entitled Attributable to Identified Sources (AIS) for assessing the output of natural language generation models, when such output pertains to the external world. We first define AIS and introduce a two-stage annotation pipeline for allowing annotators to appropriately evaluate model output according to AIS guidelines. We empirically validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset) via human evaluation studies that suggest that AIS could serve as a common framework for measuring whether model-generated statements are supported by underlying sources. We release guidelines for the human evaluation studies.
View details
SEAHORSE: A Dataset of Summaries Annotated with Human Ratings in Six Languages
Elizabeth Clark
Shruti Rijhwani
Sebastian Gehrmann
EMNLP 2023, Association for Computational Linguistics (2023)
Preview abstract
We introduce Seahorse (SummariEs Annotated with Human Ratings in Six languagEs), a dataset of 96K summaries with ratings along 6 dimensions (comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness). The summaries are generated from 8 different models, conditioned on source text from 4 datasets in 6 languages (German, English, Spanish, Russian, Turkish, and Vietnamese). We release the annotated summaries as a resource for developing better summarization models and automatic metrics. We present an analysis of the dataset's composition and quality, and we demonstrate the potential of this dataset for building better summarization metrics, showing that metrics finetuned with Seahorse data outperform baseline metrics.
View details
Planning with Learned Entity Prompts for Abstractive Summarization
Yao Zhao
Ryan McDonald
Transactions of the Association for Computational Linguistics, 9 (2021), 1475–1492
Preview abstract
We investigate Entity Chain -- a chain of related entities in the summary -- as an intermediate summary representation to better plan and ground the generation of abstractive summaries. In particular, we achieve this by augmenting the target by appending it with an entity chain extracted from the target. We experiment with Transformer-based encoder-decoder models; a transformer encoder first encodes the input and a transformer decoder generates an intermediate summary representation in the form of an entity chain and then continues generating the summary conditioned on the entity chain and the input. We evaluate our approach on a diverse set of text summarization tasks and show that Pegasus finetuned models with entity chains clearly outperform regular finetuning in terms of entity accuracy. We further demonstrate that our simple method can be easily used for pretraining summarization models to do entity-level content planning and summary generation. We see further gains with pretraining.
View details
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
Eunsol Choi
Transactions of the Association for Computational Linguistics (2020)
Preview abstract
Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA, a question answering dataset covering 11 typologically diverse languages. Until recently, most multilingual research in natural language processing has been limited to machine translation or to technical tasks such as tagging and parsing. Question answering offers a scenario that is natural in that non-technical users intuitively understand the task, allowing high quality data collection in the absence of abundant annotators with expertise in both linguistics and the language of interest. This allows us select languages that are diverse with regard to their typology -- the set of linguistic features that each language expresses. We expect that models that can perform well on this set will generalize across a large number of the languages in the world. To encourage a more realistic distribution, the data is collected entirely in each native language without the use of translation (human or otherwise) and question creation is performed without seeing the answers. We present a quantitative analysis of the data quality, we provide example-level linguistic analyses and glosses of language phenomena that would not be found in English-only corpora, and we measure the performance of baseline systems.
View details
The Morpho-syntactic Annotation of Animacy for a Dependency Parser
Ali Elkahky
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018), pp. 2607-2615
Preview abstract
In this paper we present the annotation scheme and parser results of the animacy feature in Russian and Arabic, two morphologicallyrich languages, in the spirit of the universal dependency framework (McDonald et al., 2013; de Marneffe et al., 2014). We explain the animacy hierarchies in both languages and make the case for the existence of five animacy types. We train a morphological analyzer on the annotated data and the results show a prediction f-measure for animacy of 95.39% for Russian and 92.71% for Arabic. We also use animacy along with other morphological tags as features to train a dependency parser, and the results show a slight improvement gained from animacy. We compare the impact of animacy on improving the dependency parser to other features found in nouns, namely, ‘gender’, ‘number’, and ‘case’. To our knowledge this is the first contrastive study of the impact of morphological features on the accuracy of a transition parser. A portion of our data (1,000 sentences for Arabic and Russian each, along with other languages) annotated according to the scheme described in this paper is made publicly available (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1983) as part of the CoNLL 2017 Shared Task on Multilingual Parsing (Zeman et al., 2017).
View details
Preview abstract
We describe a pre-existing rule-based homograph disambiguation system used for text-to-speech synthesis at Google, and compare it to a novel system which performs disambiguation using classifiers trained on a small amount of labeled data. An evaluation of these systems, using a new, freely available English data set, finds that hybrid systems (making use of both rules and machine learning) are significantly more accurate than either hand-written rules or machine learning alone. The evaluation also finds minimal performance degradation when the hybrid system is configured to run on limited-resource mobile devices rather than on production servers. The two best systems described here are used for homograph disambiguation on all US English text-to-speech traffic at Google.
View details