Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
16 - 30 of 235 publications
Dynamic Composition for Conversational Domain Exploration
Eran Ofek
Sagie Israel Pudinsky
Asaf Revach
Shimi Salant
The Web Conference, ACM (2020), 872–883
Preview abstract
We study conversational domain exploration (CODEX), where the user’s goal is to enrich her knowledge of a given domain by conversing with an informative bot. Such conversations should be well grounded in high-quality domain knowledge as well as engaging and open-ended. A CODEX bot should be proactive and introduce relevant information even if not directly asked for by the user. The bot should also appropriately pivot the conversation to undiscovered regions of the domain. To address these dialogue characteristics, we introduce a novel approach termed dynamic composition that decouples candidate content generation from the flexible composition of bot responses. This allows the bot to control the source, correctness and quality of the offered content, while achieving flexibility via a dialogue manager that selects the most appropriate contents in a compositional manner. We implemented a CODEX bot based on dynamic composition and integrated it into the Google Assistant. As an example domain, the bot conversed about the NBA basketball league in a seamless experience, such that users were not aware whether they were conversing with the vanilla system or the one augmented with our CODEX bot. Results are positive and offer insights into what makes for a good conversation. To the best of our knowledge, this is the first real user experiment of open-ended dialogues as part of a commercial
assistant system.
View details
On Identifiability in Transformers
Gino Brunner
Yang Liu
Damian Pascual Ortiz
Oliver Richter
Roger Wattenhofer
ICLR (2020)
Preview abstract
In this paper we delve deep in the Transformer architecture by investigating two of its core components: self-attention and contextual embeddings. In particular, we study the identifiability of attention weights and token embeddings, and the aggregation of context into hidden tokens. We show that, for sequences longer than the attention head dimension, attention weights are not identifiable. We propose effective attention as a complementary tool for improving explanatory interpretations based on attention. Furthermore, we show that input tokens retain to a large degree their identity across the model. We also find evidence suggesting that identity information is mainly encoded in the angle of the embeddings and gradually decreases with depth. Finally, we demonstrate strong mixing of input information in the generation of contextual embeddings by means of a novel quantification method based on gradient attribution. Overall, we show that self-attention distributions are not directly interpretable and present tools to better understand and further investigate Transformer models.
View details
Effective Multi Dialectal Arabic POS Tagging
Kareem Darwish
Hamdy Mubarak
Younes Samih
Ahmed Abdelali
Lluís Màrquez
Mohamed Eldesouki
Laura Kallmeyer
Natural Language Engineering (NLE) (2020)
Preview abstract
This work introduces robust multi-dialectal part of speech tagging trained on an annotated dataset of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses Conditional Random Fields (CRF), while the second combines word and character-based representations in a Deep Neural Network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a dataset of 350 tweets per dialect.
View details
Open-source Multi-speaker Corpora of the English Accents in the British Isles
Clara E. Rivera
Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, 6532‑-6541
Preview abstract
This paper presents a dataset of transcribed high-quality audio of English sentences recorded by volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena and providing a high phoneme coverage. The scripts include pronunciations of global locations, major airlines and common personal names in different accents; and native speaker pronunciations of local words. Overlapping lines for all speakers were included for idiolect elicitation which include the same or similar lines with other existing resources such as the CSTR VCTK corpus and the Speech Accent Archive to allow for easy comparison of personal and regional accents. The resulting corpora include over 31 hours of recordings from 120 volunteers who self-identify as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English.
View details
Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation
Aditya Siddhant
Ankur Bapna
Henry Tsai
Naveen Ari
AAAI 2020 (2020)
Preview abstract
Recently proposed Massively Multilingual Neural Machine Translation system has been shown to be capable of translating 102 languages to and from English within a single model. In this paper, we evaluate the cross-lingual effectiveness of representations from the encoder of such a model on 5 downstream classification and sequence tagging tasks spanning more than 50 languages. We compare our results to a strong multilingual baseline, BERT and show modest gains on zero-shot cross-lingual transfer in 4 out of these 5 tasks. Our results provide strong insight into how applicable the representations learned from multilingual machine translation are, across languages and tasks.
View details
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
Eunsol Choi
Transactions of the Association for Computational Linguistics (2020)
Preview abstract
Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA, a question answering dataset covering 11 typologically diverse languages. Until recently, most multilingual research in natural language processing has been limited to machine translation or to technical tasks such as tagging and parsing. Question answering offers a scenario that is natural in that non-technical users intuitively understand the task, allowing high quality data collection in the absence of abundant annotators with expertise in both linguistics and the language of interest. This allows us select languages that are diverse with regard to their typology -- the set of linguistic features that each language expresses. We expect that models that can perform well on this set will generalize across a large number of the languages in the world. To encourage a more realistic distribution, the data is collected entirely in each native language without the use of translation (human or otherwise) and question creation is performed without seeing the answers. We present a quantitative analysis of the data quality, we provide example-level linguistic analyses and glosses of language phenomena that would not be found in English-only corpora, and we measure the performance of baseline systems.
View details
Preview abstract
We release high quality processed text of Wikipedia for 40+ languages. We train monolingual causal language models establishing the first reported baselines for many languages. We also introduce the task of crosslingual causal modeling, we train our baseline model(transformer-xl) and report our results with varying setups. We release our data and trained models for the community to use as baseline for the further research in causal language modeling and crosslingual learning.
View details
Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation
Aditya Siddhant
Ankur Bapna
Naveen Ari
Yonghui Wu
Yuan Cao
ACL 2020 (2020)
Preview abstract
Over the last few years two promising research directions in low-resource neural machine translation (NMT) have emerged. The first focuses on utilizing high-resource languages to improve the quality of low-resource languages via multilingual NMT. The second direction employs monolingual data with self-supervision to pre-train translation models, followed by fine-tuning on small amounts of supervised data. In this work, we join these two lines of research and demonstrate the efficacy of monolingual data with self-supervision in multilingual NMT. We offer three major results: (i) Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models. (ii) Self-supervision improves zero-shot translation quality in multilingual models. (iii) Leveraging monolingual data with self-supervision provides a viable path towards adding new languages to multilingual models, getting up to 28 BLEU on ro-en translation without any parallel data or back-translation.
View details
Preview abstract
We address the problem of extractive question answering using document-level distant super-vision, pairing questions and relevant documents with answer strings. We compare previously used probability space and distant super-vision assumptions (assumptions on the correspondence between the weak answer string labels and possible answer mention spans). We show that these assumptions interact, and that different configurations provide complementary benefits. We demonstrate that a multi-objective model can efficiently combine the advantages of multiple assumptions and out-perform the best individual formulation. Our approach outperforms previous state-of-the-art models by 4.3 points in F1 on TriviaQA-Wiki and 1.7 points in Rouge-L on NarrativeQA summaries.
View details
GoEmotions: A Dataset of Fine-Grained Emotions
Dorottya Demszky
Alan Cowen
Gaurav Nemade
Sujith Ravi
ACL (2020) (to appear)
Preview abstract
Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multiple downstream tasks. We introduce GoEmotions, the largest manually annotated dataset of 58k English Reddit comments, labeled for 27 emotion categories or Neutral. We demonstrate the high quality of the annotations via Principal Preserved Component Analysis. We conduct transfer learning experiments with existing emotion benchmarks to show that our dataset generalizes well to other domains and different emotion taxonomies. Our BERT-based model achieves an average F1-score of .46 across our proposed taxonomy, leaving much room for improvement.
View details
Open-Source High Quality Speech Datasets for Basque, Catalan and Galician
Alena Butryna
Clara E. Rivera
Proc. of 1st Joint Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) Workshop (SLTU-CCURL 2020), European Language Resources Association (ELRA), 11--12 May, Marseille, France, pp. 21-27
Preview abstract
This paper introduces three new open speech datasets for Basque, Catalan and Galician, which are languages of Spain, where Catalan is furthermore the official language of the Principality of Andorra. The datasets consist of high-quality multi-speaker recordings of the three languages along with the associated transcriptions. The resulting corpora include over 33 hours of crowd-sourced recordings
of 132 male and female native speakers. The recording scripts also include material for elicitation of global and local place names, personal and business names. The datasets are released under a permissive license and are available for free download for commercial, academic and personal use. The high-quality annotated speech datasets described in this paper can be used to, among other things, build text-to-speech systems, serve as adaptation data in automatic speech recognition and provide useful phonetic and phonological insights in corpus linguistics.
View details
Phonotactic Complexity and its Trade-offs
Tiago Pimentel
Ryan Cotterell
Transactions of the Association for Computational Linguistics (TACL), 8 (2020), pp. 1-18
Preview abstract
We present methods for calculating a measure of phonotactic complexity—bits per phoneme—that permits a straightforward cross-linguistic comparison. When given a word, represented as a sequence of phonemic segments such as symbols in the international phonetic alphabet, and a statistical model trained on a sample of word types from the language, we can approximately measure bits per phoneme using the negative log-probability of that word under the model. This simple measure allows us to compare the entropy across languages, giving insight into how complex a language’s phonotactics is. Using a collection of 1016 basic concept words across 106 languages, we demonstrate a very strong negative correlation of −0.74 between bits per phoneme and the average length of words.
View details
Preview abstract
The evaluation and exchange of large lexicon databases remains a challenge in many NLP applications. Despite the existence of commonly accepted standards for the format and the features used in a lexicon, there is still a lack of precise and interoperable requirement specifications about how lexical entries of a particular language should look like, both in terms of the numbers of forms and in terms of features associated with these forms. This paper presents the notion of "lexical masks", a powerful tool used to evaluate and exchange lexicon databases in many languages.
View details
Burmese Speech Corpus, Finite-State Text Normalization and Pronunciation Grammars with an Application to Text-to-Speech
Yin May Oo
Chen Fang Li
Pasindu De Silva
Supheakmungkol Sarin
Knot Pipatsrisawat
Martin Jansche
Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, pp. 6328-6339
Preview abstract
This paper introduces an open-source crowd-sourced multi-speaker speech corpus along with the comprehensive set of finite-state transducer (FST) grammars for performing text normalization for the Burmese (Myanmar) language. We also introduce the open-source finite-state grammars for performing grapheme-to-phoneme (G2P) conversion for Burmese. These three components are necessary (but not sufficient) for building a high-quality text-to-speech (TTS) system for Burmese, a tonal Southeast Asian language from the Sino-Tibetan family which presents several linguistic challenges. We describe the corpus acquisition process and provide the details of our finite statebased approach to Burmese text normalization and G2P. Our experiments involve building a multispeaker TTS system based on long short term memory (LSTM) recurrent neural network (RNN) models, which were previously shown to perform well for other languages in a low-resource setting. Our results indicate that the data and grammars that we are announcing are sufficient to build
reasonably high-quality models comparable to other systems. We hope these resources will facilitate speech and language research on the Burmese language, which is considered by many to be lowresource due to the limited availability of free linguistic data.
View details
Eidos: An Open-Source Auditory Periphery Modeling Toolkit and Evaluation of Cross-Lingual Phonemic Contrasts
Proc. of 1st Joint Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) Workshop (SLTU-CCURL 2020), European Language Resources Association (ELRA), 11--12 May, Marseille, France, pp. 9-20
Preview abstract
Many analytical models that mimic, in varying degree of detail, the basic auditory processes involved in human hearing have been developed over the past decades. While the auditory periphery mechanisms responsible for transducing the sound pressure wave into the auditory nerve discharge are relatively well understood, the models that describe them are usually very complex because they try to faithfully simulate the behavior of several functionally distinct biological units involved in hearing. Because of this, there is a relative scarcity of toolkits that support combining publicly-available auditory models from multiple sources. We address this shortcoming by presenting an open-source auditory toolkit that integrates multiple models of various stages of human auditory processing into a simple and easily configurable pipeline, which supports easy switching between ten available models. The auditory representations that the pipeline produces can serve as machine learning features and provide analytical benchmark for comparing against auditory filters learned from the data. Given a low- and high-resource language pair, we evaluate several auditory representations on a simple multilingual phonemic contrast task to determine whether contrasts that are meaningful within a language are also empirically robust across languages.
View details