Jump to Content
David Vilar

David Vilar

David Vilar is a Research Scientist working for Google Translate since 2020. He has been working on machine translation research since 2003, and was co-author of the open source Jane (developed by RWTH Aachen) and Sockeye (developed by Amazon) systems.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    There's no Data Like Better Data: Using QE Metrics for MT Data Filtering
    Jan-Thorsten Peter
    Mara Finkelstein
    Jurik Juraska
    Proceedings of the Eighth Conference on Machine Translation, Association for Computational Linguistics, Singapore (2023), pp. 561-577
    Preview abstract Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics. In this paper we analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems (NMT). While most corpus filtering methods are focused on detecting noisy examples in collections of texts, usually huge amounts of web crawled data, QE models are trained to discriminate more fine-grained quality differences. We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half. We also provide a detailed analysis of the filtering results, which highlights the differences between both approaches. View details
    Prompting PaLM for Translation: Assessing Strategies and Performance
    Jiaming Luo
    Viresh Ratnakar
    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada (2023), 15406–15427
    Preview abstract Large language models (LLMs) that have been trained on multilingual but not parallel text exhibit a remarkable ability to translate between languages. We probe this ability in an in-depth study of the pathways language model (PaLM), which has demonstrated the strongest machine translation (MT) performance among similarly-trained LLMs to date. We investigate various strategies for choosing translation examples for few-shot prompting, concluding that example quality is the most important factor. Using optimized prompts, we revisit previous assessments of PaLM’s MT capabilities with more recent test sets, modern MT metrics, and human evaluation, and find that its performance, while impressive, still lags that of state-of-the-art supervised systems. We conclude by providing an analysis of PaLM’s MT output which reveals some interesting properties and prospects for future work. View details
    Preview abstract We address efficient calculation of influence functions (Koh & Liang 2017) for tracking predictions back to the training data. We propose and analyze a new approach to speeding up the inverse Hessian calculation based on Arnoldi iteration (Arnoldi 1951). With this improvement, we achieve, to the best of our knowledge, the first successful implementation of influence functions that scales to full-size (language and vision) Transformer models with several hundreds of millions of parameters. We evaluate our approach in image classification and sequence-to-sequence tasks with tens to a hundred of millions of training examples. Our implementation will be publicly available at https://github.com/google-research/jax-influence. View details
    A Natural Diet: Towards Improving Naturalness of Machine Translation Output
    David Grangier
    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online (2022)
    Preview abstract Machine translation (MT) evaluation often focuses on accuracy and fluency, without paying much attention to translation style. This means that, even when considered accurate and fluent, MT output can still sound less natural than high quality human translations or text originally written in the target language. Machine translation output notably exhibits lower lexical diversity, and employs constructs that mirror those in the source sentence. In this work we propose a method for training MT systems to achieve a more natural style, i.e. mirroring the style of text originally written in the target language. Our method tags parallel training data according to the naturalness of the target side by contrasting language models trained on natural and translated data. Tagging data allows us to put greater emphasis on target sentences originally written in the target language. Automatic metrics show that the resulting models achieve lexical richness on par with human translations, mimicking a style much closer to sentences originally written in the target language. Furthermore, we find that their output is preferred by human experts when compared to the baseline translations. View details
    Controlling Machine Translation for Multiple Aspects with Additive Interventions
    Andrea Schioppa
    Artem Sokolov
    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, pp. 6676-6696
    Preview abstract Fine-grained control of machine translation (MT) outputs along multiple aspects is critical for many modern MT applications and is a requirement for gaining users' trust. A standard approach for exerting control in MT is to prepend the input with a special tag to signal the desired output aspect. Despite its simplicity, aspect tagging has several drawbacks: continuous values must be binned into discrete categories, which is unnatural for certain applications; interference between multiple tags is poorly understood and needs fine-tuning. We address these problems by introducing vector-valued interventions which allow for fine-grained control over multiple aspects simultaneously via a weighted linear combinations of the corresponding vectors. For some aspects, our approach even allows for fine-tuning a model trained without annotations to support such interventions. In experiments with three aspects (length, politeness and monotonicity) and two language pairs (English to German and Japanese) our models achieve better control over a wider range of tasks compared to tagging, and translation quality does not degrade when no control is requested. Finally, we demonstrate how to enable control in an already trained model after a relatively cheap fine-tuning stage. View details
    Preview abstract Training data for machine translation (MT) is often sourced from a multitude of large corpora that are multi-faceted in nature, e.g. containing contents from multiple domains or different levels of quality or complexity. Naturally, these facets do not occur with equal frequency, nor are they equally important for the test scenario at hand. In this work, we propose to optimize this balance jointly with MT model parameters to relieve system developers from manual schedule design. A multi-armed bandit is trained to dynamically choose between facets in a way that is most beneficial for the MT system. We evaluate it on three different multi-facet applications: balancing translationese and natural training data, or data from multiple domains or multiple language pairs. We find that bandit learning leads to competitive MT systems across tasks, and our analysis provides insights into its learned strategies and the underlying data sets. View details
    A Statistical Extension of Byte-Pair Encoding
    Marcello Federico
    Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), Association for Computational Linguistics, Bangkok, Thailand (online), 263–275
    Learning Hidden Unit Contribution for Adapting Neural Machine Translation Models
    Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Association for Computational Linguistics, New Orleans, Louisiana, 500–505
    Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation
    Matt Post
    Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana, 1314–1324
    Sockeye: A toolkit for neural machine translation
    Felix Hieber
    Tobias Domhan
    Michael Denkowski
    Artem Sokolov
    Ann Clifton
    Matt Post
    CoRR, vol. abs/1712.05690 (2017)
    Jane: an advanced freely available hierarchical machine translation toolkit
    Daniel Stein
    Matthias Huck
    Hermann Ney
    Machine Translation, vol. 26 (2012), 197–216
    Cardinality pruning and language model heuristics for hierarchical phrase-based translation
    Hermann Ney
    Machine Translation, vol. 26 (2012), 217–254
    Jane: Open Source Hierarchical Translation, Extended with Reordering and Lexicon Models
    Daniel Stein
    Matthias Huck
    Hermann Ney
    Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, Association for Computational Linguistics, Uppsala, Sweden (2010), 262–270
    On LM Heuristics for the Cube Growing Algorithm
    Hermann Ney
    Proceedings of the 13th Annual conference of the European Association for Machine Translation, European Association for Machine Translation, Barcelona, Spain (2009), pp. 242-249
    Can We Translate Letters?
    Jan-Thorsten Peter
    Hermann Ney
    Proceedings of the Second Workshop on Statistical Machine Translation, Association for Computational Linguistics, Prague, Czech Republic (2007), 33–39
    Error Analysis of Statistical Machine Translation Output
    Jia Xu
    Luis Fernando D'Haro
    Hermann Ney
    Proceedings of the 5th Edition of the International Conference on Language Resources and Evaluation, Genoa, Italy (2006), pp. 697-702