Jump to Content
Aliaksei Severyn

Aliaksei Severyn

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Text-editing models have recently become a prominent alternative to seq2seq models for monolingual natural language generation (NLG) tasks such as grammatical error correction, text simplification, and style transfer. These tasks exhibit a large amount of textual overlap between the source and target texts. Text-editing models take advantage of this trait and learn to generate the output by predicting edit operations applied to the source sequence in contrast to seq2seq models that generate the output from scratch. Text-editing models provide several benefits over seq2seq models including faster inference speed, higher sample efficiency, and better control and interpretability of the outputs. This tutorial provides a comprehensive overview of the text-edit based approaches and current state-of-the-art models, analyzing the pros and cons of different methods. We discuss challenges related to productionization and how these models can to help mitigate hallucination and bias, both pressing challenges in the field of text generation. View details
    Preview abstract We propose a new model for grammatical error correction (GEC) which builds on a very large multilingual masked language model, covering 101 languages. To adapt our model for the GEC task, we design an unsupervised, language-agnostic pretraining objective that mimics corrections typically contained in labeled data. After finetuning on gold data, we surpass the previous state-of-the-art results on the four evaluated languages (Czech, English, German and Russian). This approach shows the power of large multilingual language models. Due to these models being non-trivial to run on non-cluster infrastructure, we employ our model to clean up the labels in the popular yet noisy Lang-8 dataset. We release this dataset and hope that the community will find it useful for further advancement of GEC. View details
    Leveraging Pre-trained Checkpoints for Sequence Generation Tasks
    Transactions of the Association for Computational Linguistics, vol. 8 (2020), pp. 264-280
    Preview abstract Pre-training Neural Networks have become widely successful in Natural Language Processing. Training these large models on unsupervised data is costly and often not feasible. We therefore concentrate on publicly available checkpoints. While most of them improve the Natural Language Understanding, we investigate initializing Transformer-based Sequence-to-sequence models with these pre-trained models for Natural Language Understanding and Generation. Using these pre-trained models we achieve new state-of-the-art results on Machine translation, Summarization and Sentence Splitting/Fusion. View details
    Preview abstract We present FELIX --- a flexible text-editing approach for generation, designed to derive maximum benefit from the ideas of decoding with bi-directional contexts and self-supervised pre-training. In contrast to conventional sequence-to-sequence (seq2seq) models, FELIX is efficient in low-resource settings and fast at inference time, while being capable of modeling flexible input-output transformations. We achieve this by decomposing the text-editing task into two sub-tasks: tagging to decide on the subset of input tokens and their order in the output text and insertion to in-fill the missing tokens in the output not present in the input. The tagging model employs a novel Pointer mechanism, while the insertion model is based on a Masked Language Model. Both of these models are chosen to be non-autoregressive to guarantee faster inference. FELIX performs favourably when compared to recent text-editing methods and strong seq2seq baselines when evaluated on four NLG tasks: Sentence Fusion, Machine Translation Automatic Post-Editing, Summarization, and Text Simplification. View details
    Preview abstract We propose MASKER, an unsupervised text-editing method for style transfer. To tackle cases when no parallel source–target pairs are available, we train masked language models (MLMs) for both the source and the target domain. Then we find the text spans where the two models disagree the most in terms of likelihood. This allows us to identify the source tokens to delete to transform the source text to match the style of the target domain. The deleted tokens are replaced with the target MLM, and by using a padded MLM variant, we avoid having to predetermine the number of inserted tokens. Our experiments on sentence fusion and sentiment transfer demonstrate that MASKER performs competitively in a fully unsupervised setting. Moreover, in low-resource settings, it improves supervised methods’ accuracy by over 10 percentage points when pre-training them on silver training data generated by MASKER. View details
    Using Audio Transformations to Improve Comprehension in Voice Question Answering
    Johanne R. Trippas
    Hanna Silen
    Damiano Spina
    Crestani F. et al. (eds) Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2019, Springer, Cham, pp. 164-170
    Preview abstract Many popular form factors of digital assistants—such as Amazon Echo, Apple Homepod, or Google Home—enable the user to hold a conversation with these systems based only on the speech modality. The lack of a screen presents unique challenges. To satisfy the information need of a user, the presentation of the answer needs to be optimized for such voice-only interactions. In this paper, we propose a task of evaluating the usefulness of audio transformations (i.e., prosodic modifications) for voice-only question answering. We introduce a crowdsourcing setup where we evaluate the quality of our proposed modifications along multiple dimensions corresponding to the informativeness, naturalness, and ability of the user to identify key parts of the answer. We offer a set of prosodic modifications that highlight potentially important parts of the answer using various acoustic cues. Our experiments show that some of these modifications lead to better comprehension at the expense of only slightly degraded naturalness of the audio. View details
    Preview abstract We propose LaserTagger - a sequence tagging approach that casts text generation as a text editing task. Target texts are reconstructed from the inputs using three main edit operations: keeping a token, deleting it, and adding a phrase before the token. To predict the edit operations, we propose a novel model, which combines a BERT encoder with an autoregressive Transformer decoder. This approach is evaluated on English text on four tasks: sentence fusion, sentence splitting, abstractive summarization, and grammar correction. LaserTagger achieves new state-of-the-art results on three of these tasks, performs comparably to a set of strong seq2seq baselines with a large number of training examples, and outperforms them when the number of examples is limited. Furthermore, we show that at inference time tagging can be more than two orders of magnitude faster than comparable seq2seq models, making it more attractive for running in a live environment. View details
    Preview abstract The softmax function on top of a final linear layer is the de facto method to output probability distributions in neural networks. In many applications such as language models or text generation, these models have to produce distributions over large output vocabularies. Recently, this has been shown to have limited representational capacity due to its connection with the rank bottleneck in matrix factorization. However, little is known about the limitations of linear-softmax for quantities of practical interest such as cross entropy or mode estimation, direction theoretically and empirically explored in this paper. As an efficient and effective solution to alleviate this issue, we propose to learn parametric monotonic functions on top of the logits. Theoretically, we show that such monotonic functions are likely to increase the rank of a matrix to its full rank. Empirically, our method improves over the traditional softmax-linear layer both in synthetic and real language model experiments with negligible time or memory overhead, while being comparable to the more computationally expensive mixture of softmax distributions. View details
    Preview abstract In this paper we study various flavors of variational autoencoders and address the methodological issues with the current neural text generation research and also close some gaps by answering a few natural questions to the studies already published. View details
    Preview abstract In this paper we explore the effect of architectural choices on Variational Autoencoder models for text. In contrast to the previously introduced VAE model for text where both the encoder and decoder are RNNs, we propose a novel hybrid architecture that blends a fully feed-forward convolutional and deconvolutional component with a recurrent language model. This architecture exhibits several attractive properties such as fast run time, ability to better handle long sequences and, more importantly, we demonstrate that our model helps to avoid some of the major difficulties posed by training VAE models on textual data. View details
    Neural Ranking Models with Weak Supervision
    Hamed Zamani
    Jaap Kamps
    W. Bruce Croft
    Proceedings of The 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (2017)
    Preview abstract Despite the impressive improvements achieved by unsupervised deep neural networks in computer vision, natural language processing, and speech recognition tasks, such improvements have not generally been observed in ranking for information retrieval. The reason might be related to the complexity of the ranking problem, in the sense that it is not obvious how to learn from queries and documents when no supervised signal is available. Hence, in this paper, we propose to train a neural ranking model from a weak supervision signal, which is a training signal that can be obtained automatically without human labeling or any external resources (e.g., click data). To this aim, we use the output of a known unsupervised ranking model, such as BM25, as a weak supervision signal. We further train a set of simple yet e‚ffective ranking models based on feed-forward neural networks. We study their e‚ffectiveness under various learning scenarios (point-wise and pair-wise models) and using di‚fferent input representations (i.e., from encoding query-document pairs into dense/sparse vectors to using word embedding representation). We train our network on 5 million unique queries obtained from the publicly available AOL query logs and two standard collections: a homogeneous news collection (Robust) and a heterogeneous large-scale web collection (ClueWeb). Our experiments indicate that feeding raw data to the networks and letting them learn representations for the input data leads to an impressive performance, with over 13% and 35% MAP improvements compared to the BM25 model on the Robust and the ClueWeb collections, respectively. Our findings suggest that neural ranking models can greatly benefit from large amounts of weakly labeled data that can be easily obtained from unsupervised IR models. View details
    Preview abstract In this paper, we propose a method for training neural networks when we have a large set of data with weak labels and a small amount of data with true labels. In our proposed model, we train two neural networks: a target network, the learner and a confidence network, the meta-learner. The target network is optimized to perform a given task and is trained using a large set of unlabeled data that are weakly annotated. We propose to control the magnitude of the gradient updates to the target network using the scores provided by the second confidence network, which is trained on a small amount of supervised data. Thus we avoid that the weight updates computed from noisy labels harm the quality of the target network model. View details
    Preview abstract Making use of weak or noisy signals, like the output of heuristic methods or user click through data for training deep neural networks is increasing, in particular for the tasks where an adequate amount of data with true labels is not available. In a semi-supervised setting, we can use a large set of data with weak labels to pretrain a neural network and fine tune the parameters with a small amount of data with true labels. However, these two independent stages do not leverage the full capacity of clean information from true labels during pretraining. In this paper, we propose a semi-supervised learning method where we train two neural networks in a multi-task fashion: a target network and a confidence network. The target network is optimized to perform a given task and is trained using a large set of unlabeled data that are weakly annotated. We propose to weight the gradient updates to the target network using the scores provided by the second confidence network, which is trained on a small amount of supervised data. Thus we avoid that the weight updates computed from noisy labels harm the quality of the target network model. We evaluate our learning strategy on two different tasks: document ranking and sentiment classification. The results demonstrate that our approach not only enhances the performance compared to the baselines but also speeds up the learning process from weak labels. View details
    Recurrent Dropout without Memory Loss
    Stanislau Semeniuta
    Erhardt Barth
    ArXiv (2016)
    Preview abstract This paper presents a novel approach to recurrent neural network (RNN) regularization. Differently from the widely adopted dropout method, which is applied to forward connections of feed-forward architectures or RNNs, we propose to drop neurons directly in recurrent connections in a way that does not cause loss of long-term memory. Our approach is as easy to implement and apply as the regular feed-forward dropout and we demonstrate its effectiveness for the most popular recurrent networks: vanilla RNNs, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. Our experiments on three NLP benchmarks show consistent improvements even when combined with conventional feed-forward dropout. View details
    Preview abstract We introduce a globally normalized transition-based neural network model that achieves state-of-the-art part-of-speech tagging, dependency parsing and sentence compression results. Our model is a simple feed-forward neural network that operates on a task-specific transition system, yet achieves comparable or better accuracies than recurrent models. We discuss the importance of global as opposed to local normalization: a key insight is that the label bias problem implies that globally normalized models can be strictly more expressive than locally normalized models. View details
    Opinion Mining on YouTube
    Olga Uryupina
    Barbara Plank
    Alessandro Moschitti
    Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL'14) (2014), pp. 1252-1261
    No Results Found